0% found this document useful (0 votes)

58 views10 pages

PixelTransformer - Sample Conditioned Signal Generation

The document presents PixelTransformer, a generative model designed for sample conditioned signal generation, which infers distributions based on sparse pixel observations. Unlike traditional autoregressive models that follow a fixed sequencing, PixelTransformer allows for arbitrary conditioning on sample pixels and can generate diverse outputs for various spatial signals, including images, 3D shapes, and videos. The approach is validated across multiple datasets, demonstrating its ability to predict distributions effectively from limited input data.

Uploaded by

a710859414

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views10 pages

PixelTransformer - Sample Conditioned Signal Generation

Uploaded by

a710859414

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

PixelTransformer: Sample Conditioned Signal Generation

Shubham Tulsiani 1 Abhinav Gupta 1 2

https://shubhtuls.github.io/PixelTransformer/

Abstract sive approaches directly model the joint distribution over

pixels. This can be easily cast as product of conditional
arXiv:2103.15813v1 [cs.CV] 29 Mar 2021

We propose a generative model that can infer a

distribution (van den Oord et al., 2016b;c) which makes
distribution for the underlying spatial signal con-
it tractable to compute. Conditional distributions are esti-
ditioned on sparse samples e.g. plausible images
mated by learning to predict new pixels from previously
given a few observed pixels. In contrast to sequen-
sampled/generated pixels. However, these approaches use
tial autoregressive generative models, our model
fixed sequencing (mostly predicting pixels from top-left to
allows conditioning on arbitrary samples and can
bottom-right) and therefore the learned model can only take
answer distributional queries for any location. We
a fixed ordering between query and sampled pixels. This
empirically validate our approach across three im-
implies that these models cannot predict whole images from
age datasets and show that we learn to generate
a few random splashes – similar to what we humans can do
diverse and meaningful samples, with the distri-
given a description of the artist’s painting above.
bution variance reducing given more observed
pixels. We also show that our approach is appli- In this work, our goal is to build computational genera-
cable beyond images and can allow generating tive models that can achieve this – given information about
other types of spatial outputs e.g. polynomials, some random pixels and their associated color values, we
3D shapes, and videos. aim to predict a distribution over images consistent with
the evidence. To this end, we show that it suffices to learn
a function that estimates the distribution of possible val-
ues at any query location conditioned on an arbitrary set
1. Introduction of observed samples. We present an approach to learn this
Imagine an artist with an empty canvas. She starts with a function in a self-supervised manner, and show that it can
dab of sky blue paint at the top, and a splash of fresh green allow answering queries that previous sequential autoregres-
at the bottom. What is the painting going to depict? Perhaps sive models cannot e.g. mean image given observed pixels,
an idyllic meadow, or trees in garden under a clear sky? But or computing image distribution given random observations.
probably not a living room. It is quite remarkable that given We also show that our proposed framework is generally
only such sparse information about arbitrary locations, we applicable beyond images and can be learned to generate
can make guesses about the image in the artist’s mind. generic dense spatial signals given corresponding samples.
The field of generative modeling of images, with the goal
of learning the distribution of possible images, focuses on
developing similar capabilities in machines. Most recent 2. Formulation
approaches can be classified as belonging to one of the two Given the values of some (arbitrary) pixels, we aim to in-
modeling frameworks. First, and more commonly used, is fer what images are likely conditioned on this observation.
the latent variable modeling framework (Kingma & Welling, More formally, for any pixel denoted by random variable x,
2013; Goodfellow et al., 2014). Here, the goal is to represent let vx denote the value for that pixel and let S0 ≡ {vxk }Kk=1
the possible images using a distribution over a bottleneck correspond to a set of such sampled values. We are then in-
latent variable, samples from which can be decoded to ob- terested in modeling p(I|S0 ) i.e. the conditional distribution
tain images. However, computing the exact probabilities over images I given a set of sample pixel values S0 .
for images is often intractable and it is not straightforward
From Image to Pixel Value Distribution. We first note
to condition inference on sparse observations e.g. pixel
that an image is simply a collection of values of pixels
values. As an alternative, a second class of autoregres-
in a discrete grid. Assuming an image has N pixels with
1
Facebook AI Research 2 Carnegie Mellon University. Corre- locations denoted as {gn }N n=1 , our goal is therefore to model
spondence to: Shubham Tulsiani <[email protected]>. p(I|S0 ) ≡ p(vg1 , vg2 , . . . , vgN |S0 ). Instead of modeling
PixelTransformer: Sample Conditioned Signal Generation

this joint distribution directly, we observe that it can be base architectures (Chen et al., 2020; Parmar et al., 2018)
further factorized as a product of conditional distributions such as Transformers (Vaswani et al., 2017).
using the chain rule:
While this line of work has led to impressive results, the
Y core distribution modeled is that of the ‘next’ value given
p(vg1 , vg2 , . . . , vgN |S0 ) = p(vgn |S0 , vg1 , . . . , vgn−1 )
‘previous’ values. More formally, while we aim to predict
n
p(vx |S) for arbitrary x, S, the prior autoregressive genera-
Denoting by Sn ≡ S0 ∪ {vgj }nj=1 , we obtain: tive models only infer this for cases where S contains pixels
in some sequential (e.g. raster) order and x is the immediate
‘next’ position. Although using masked convolutions can
Y
p(I|S0 ) = p(vgn |Sn−1 ) (1)
n
allow handling many possible inference orders (Jain et al.,
2020), the limited receptive field of convolutions still limits
Sample Conditioned Value Prediction. The key observa- such orders to locally continuous sequences. Our work can
tion from Eq. 1 is that all the factors are in the form of therefore be viewed as a generalization of previous ‘sequen-
p(vx |S). That is, the only queries we need to answer are: tial’ autoregressive models in two ways: a) allowing any
‘given some observed samples S, what is the distribution query position x, and b) handling arbitrary samples S for
of possible values at location x’? To learn a sample condi- conditioning. This allows us to answer questions that prior
tioned generative model for images, we therefore propose autoregressive models cannot e.g. ‘if the top-left pixel is
to learn a function fθ to infer p(vx |S) for arbitrary inputs x blue, how likely is the bottom-right one to be green?’, ‘what
and S. Concretely, we formulate our task as that of learn- is the mean image given some observations?’, or ‘given
ing a function fθ (x, {(xk , vk )}) that can predict the value values of 10 specific pixels, sample likely images’.
distribution at an arbitrary query location x given a set of
Implicit Neural Representations. There has been a grow-
arbitrary sample (position, value) pairs {(xk , vk )}.
ing interest in learning neural networks to represent 3D tex-
In summary: tured scenes (Sitzmann et al., 2019), radiance fields (Milden-
hall et al., 2020; Martin-Brualla et al., 2021; Zhang et al.,
• The task of inferring p(I|S0 ) can be reduced to queries
2020) or more generic spatial signals (Sitzmann et al., 2020;
of the form p(vx |S).
Tancik et al., 2020). The overall approach across these
• We propose to learn a function fθ (x, {(xk , vk )}) that methods is to represent the underlying signal by learning
can predict p(vx |{vxk }) for arbitrary inputs. a function gφ that maps query positions x to correspond-
ing values v (e.g. pixel location to intensity). Our learned
While we used images as a motivating example, our for- fθ (·, {(xk , vk )}) can similarly be thought of as mapping
mulation is also applicable for modeling distributions of query positions to a corresponding value (distribution),
other dense spatially varying signals. For RGB images, while being conditioned on some sample values. A key
x ∈ R2 , v ∈ R3 , but other spatial signals e.g. polynomials difference however, is the ability to generalize – the above
(x ∈ R1 , v ∈ R1 ), 3D shapes represented as Signed Dis- mentioned approaches learn an independent network per
tance Fields, (x ∈ R3 , v ∈ R1 ) or videos (x ∈ R3 , v ∈ R3 ) instance e.g. a separate gφ is used to model each scene,
can also be handled by learning fθ (x, {(xk , vk )}) of the therefore requiring from thousands to millions of samples to
corresponding form (see Section 6). fit gφ for a specific scene. In contrast, our approach uses a
common fθ across all instances and can therefore generalize
3. Related Work to unseen ones given only a sparse set of samples. Although
Autoregressive Generative Models. Closely related to our some recent approaches (Xu et al., 2019; Park et al., 2019;
work, autoregressive generative modeling approaches also Mescheder et al., 2019) have shown similar ability to gener-
factorize the joint distribution into per-location conditional alize and infer novel 3D shapes/scenes given input image(s),
distributions. Seminal works such as Wavenet (van den these cannot handle sparse input samples and do not allow
Oord et al., 2016a), PixelRNN (van den Oord et al., 2016c) inferring a distribution over the output space.
and PixelCNN (van den Oord et al., 2016b) showed that Latent Variable based Generative Models. Our approach,
we can learn the distribution over the values of the ‘next’ similar to sequential autoregressive models, factorizes the
timestep/pixel given the values of the previous ones, and image distribution as products of per-pixel distributions.
thereby learn a generative model for the corresponding An alternate approach to generative modeling, however,
domain (speech/images). Subsequent approaches have is to transform a prior distribution over latent variables
further improved over these works by modifying the to the output distribution via a learned decoder. Several
parametrization (Salimans et al., 2017), incorporating hier- approaches allow learning such a decoder by leveraging
archy (van den Oord et al., 2017; Razavi et al., 2019), or diverse objectives e.g. adversarial loss (Goodfellow et al.,
(similar to ours) foregoing convolutions in favor of alternate
PixelTransformer: Sample Conditioned Signal Generation

2014), variational bound on the log-likelihood (Kingma

& Welling, 2013), nearest neighbor matching (Bojanowski
et al., 2018; Li & Malik, 2018), or the log-likelihood with a
restricted decoder (Rezende & Mohamed, 2015). While all
of these methods allow efficiently generating new samples
from scratch (by randomly sampling in the latent space),
it is not straightforward to condition this sampling given
partial observations – which is the goal of our work.
Bayesian Optimization and Gaussian Processes. As al-
luded to earlier, any spatial signal can be considered a func-
tion from positions to values. Our goal is then to infer a
distribution over possible functions given a set of samples.
This is in fact also a central problem tackled in bayesian
Figure 1. Prediction Model. Given a set of (position, value)
optimization (Brochu et al., 2010), using techniques such as
pairs {(xk , vk )}, our model encodes them using a Trans-
gaussian processes (Rasmussen, 2003) to model the distri-
former (Vaswani et al., 2017) encoder. A query position x is
bution over functions. While the goal of these approaches then processed in context of this encoding and a value distribution
is similar to ours, the technique differs significantly. These is predicted (parametrized by ω).
classical methods assume a known prior over the space of
functions and leverage it to obtain the posterior given some this distribution? Popular choices like gaussian parametriza-
samples (we refer the reader to (Murphy, 2012) for an ex- tion may not capture the multimodal nature of the distribu-
cellent overview). Such a prior over functions (that also tion e.g. a pixel maybe black or white, but not gray. An
supports tractable inference), however, is not easily avail- alternate is to discretize the output space but this may re-
able for complex signals such as images or 3D shapes – quire a large number of bins e.g. 2563 for possible RGB
although some weak priors (Ulyanov et al., 2018; Osher values. Following PixelCNN++ (Salimans et al., 2017), we
et al., 2017) do allow impressive image restoration, they do opt for a hybrid approach – we predict probabilities for the
not enable generation given sparse samples. In contrast, our value belonging to one of B discrete bins, while also pre-
approach allows learning from data, and can be thought of as dicting a continuous gaussian parametrization within each
learning this prior as well as performing efficient inference bin. This allows predicting multimodal distributions while
via the learned model fθ . enabling continuous outputs.
Concretely, we instantiate B bins (roughly) uniformly
4. Learning and Inference spaced across the output space where for any bin b, its
Towards inferring the distribution of images given a set center corresponds to cb . The output distribution is then
of observed samples, we presented a formulation in Sec- parametrized as ω ≡ {(q b , µb , σ b )}B b 1
b=1 . Here q ∈ R is
b b
tion 2 that reduced this task to that of learning a function to the probability of assignment to bin b, c + µ is the mean of
model p(vx |{vxk }). We first describe in Section 4.1 how the corresponding gaussian distribution with uniform vari-
we parametrize this function and how one can learn it from ance σ b ∈ R1 . Assuming the values v ∈ Rd , our network
raw data. We then show in Section 4.2 and Section 4.3 how therefore outputs ω ∈ RB×(d+2) . We note that this distri-
this learned function can be used to query and draw sam- bution is akin to a mixture-of-gaussians, and given a value
ples from the conditional distribution over images p(I|S0 ). v, we can efficiently compute its likelihood p(v; ω) under it
While we use images as the running example, we reiterate (see appendix for details). We can also efficiently compute
that the approach is more generally applicable (as we also the expected value v̄ as:
empirically show in Section 6). Z XB
v̄ ≡ p(v; ω) v dv = q b (µb + cb ) (2)
4.1. Learning to Predict Value Distributions b=1
We want to learn a function fθ that can predict the proba- Model Architecture. Given a query position x, we
bility distribution of possible values at any query location want fθ (x, {(xk , vk )}) to output a value distribution as
x conditioned on a (arbitrary) set of positions with known parametrized above. There are two design considerations
values. More formally, we want fθ (x, {(xk , vk )}) to ap- that such a predictor should respect: a) allow a variable
proximate p(vx |{vxk }). number of input samples {(xk , vk )}, and b) be permutation-
Distribution Parametrization. The output of fθ is sup- invariant w.r.t. the samples. We leverage the Trans-
posed to be a distribution over possible values at location x former (Vaswani et al., 2017) architecture as our backbone
and not a single value estimate. How should we parametrize as it satisfies both these requirements. As depicted in Fig-
ure 1, our model can be considered as having two stages:
PixelTransformer: Sample Conditioned Signal Generation

4.2. Inferring Marginals and Mean

Section 4.1 introduced our approach to enable learning fθ
that can approximate p(v|S). But given such a learned
function, what can it enable us to do? One operation that
we focus on later in Section 4.3 is that of sampling images
I ∼ p(I|S). However, there is another question of interest
which is not possible to answer with the previous sequential
autoregressive models (van den Oord et al., 2016b;a), but
is efficiently computable using our model: ‘what is the
Figure 2. Training Overview. Given an image, we randomly sam- expected image I¯ given the samples S?’.
ple pixels to obtain the conditioning set S as well as a query pixel
x with value vx∗ . Our model predicts the conditional value distri- We reiterate that an image can be considered as a collection
bution for this arbitrary query location and we use the negative of values of pixels located in a discrete grid {gn }N
n=1 . In-
log-likelihood for the true value as our learning objective. stead of asking what the expected image I¯ is, we can first
consider a simpler question – what is the expected value
a) an encoder that, independent of the query x, processes v̄gn for the pixel gn given S? By definition:
the input samples {(xk , vk )} and computes a per-sample Z
embedding, and b) a decoder that predicts the output distri- v̄gn = p(vgn |S) vgn dvgn
bution by processing the query x in context of the encodings.
As shown in Figure 1, we first independently embed each As our learned model fθ allows us to directly estimate the
input sample (xk , vk ) using position and value encoding marginal distribution p(vgn |S), the above computation is
modules respectively, while following the insight from (Tan- extremely efficient to perform and can be done indepen-
cik et al., 2020) to use fourier features when embedding po- dently across all locations in the image grid {gn }N
n=1 .
sitions. These per-sample encodings are then processed by a Z
sequence of multi-headed self-attention modules (Vaswani v̄gn = p(v; ωn ) v dv; ωn = fθ (gn , {(xk , vk )}) (4)
et al., 2017) to yield the encoded representations for the
input samples. The query position x is similarly embedded, Given the estimate of v̄gn , the mean image I¯ is then just
and processed via multi-headed attention modules in context the image with each pixel assigned its mean value v̄gn i.e.
of the sample embeddings. A linear decoder finally predicts I¯ ≡ {v̄gn }N
n=1 . The key difference compared to sequen-
ω ∈ RB×(d+2) to parametrize the output distribution. tial autoregressive models (van den Oord et al., 2016b;a)
that enables our model to compute this mean image is that
Training Objective. Recall that our model our model allows computing p(vgn |S) for any location gn ,
fθ (x, {(xk , vk )}) aims to approximate p(vx |{vxk }) whereas approaches like (van den Oord et al., 2016b;a) can
for arbitrary query positions x and sample sets S ≡ {vxk }. only do so for the ‘next’ pixel.
Given a collection of training images, we can in fact
generate training data for this model in a self-supervised 4.3. Autoregressive Conditional Sampling
manner. As illustrated in Figure 2, we can simply One of the driving motivations for our work was to be able
sample arbitrary x, S from any image, and maximize the to sample the various likely images conditioned on a sparse
log-likelihood of the true value vx∗ under the predicted set of pixels with known values. That is, we want to be able
distribution p(vx |{vxk }). to draw samples from p(I|S0 ). Equivalently, to sample an
While we described the processing for a single query posi- image from p(I|S0 ), we need to sample the values at each
tion x, it is easy to parallelize inference and process a batch pixel {vgn } from p(vg1 , vg2 , . . . , vgN |S0 ).
of queries Q conditioned on the same input sample set S. As we derived in Eq. 1, this distribution can be factored
In this case, we can consider the model as independently as a product of per-pixel conditional distributions. We can
predicting p(vx |{vxk }) for each x ∈ Q. Instead of using a therefore sample from this distribution autoregressively –
single query x, we therefore use a batch of queries Q and sampling one pixel at a time, with subsequent pixels being
minimize the negative log-likelihood across them. More informed by ones sampled prior. Concretely, we iteratively
formally, given a dataset D of images, we randomly sample perform the following computation:
an image I, and then choose arbitrary sample and query sets
S, Q, and minimize the expected negative log-likelihood of ωn = fθ (gn , {xk , vk } ∪ {gj , vj0 }n−1
j=1 ) (5)
the true values as our training objective: vn0 ∼ p(v; ωn ) (6)
L= E E E − log p(vx∗ ; ω)
I∼D S,Q∼I x∼Q
(3) Here, ωn denotes the parameters for the predicted distribu-
where, ω = fθ (x; {(xk , vk )}) tion for the pixel gn . Note that this prediction takes into
PixelTransformer: Sample Conditioned Signal Generation

Figure 3. Inferred Mean Images. We visualize the mean image predicted by our learned model on random instances of the Cat Faces
dataset. Top row: ground-truth image. Rows 2-8: Predictions using increasing number of observed pixels |S|.

account not just the initial samples S0 , but also the subse- 5. Experiments
quent n − 1 samples (hence the difference from ωn in Eq. 4). To qualitatively and quantitatively demonstrate the efficacy
vn0 represents a value then sampled for the pixel gn from of our approach, we consider the task of generating images
the distribution parametrized by ωn . given a set of pixels with known values. The goal of our
Randomized Sampling Order. While we sample the val- experiments is twofold – a) to validate that our predictions
ues one pixel at a time, the ordering of pixels g1 , . . . , gN account for the observed pixels, and b) to show that the
need not correspond to anything specific e.g. it is not nec- generated samples are diverse and plausible.
essary that g1 should be the top-left pixel and gN be the Datasets. We examine our approach on three different im-
bottom-right one. In fact, as our model fθ is trained using age datasets – CIFAR10 (Krizhevsky, 2009), MNIST (Le-
arbitrary sets of samples S, using a structured sampling or- Cun et al., 1998), and the Cat Faces (Wu et al., 2020) dataset
dering e.g. raster order would make the testing setup differ while using the standard image splits. Note that we only
from training. Instead, for every sample I ∼ p(I|S) that we require the images for training – class or attribute labels are
draw, we use a new random order in which the pixels of the not leveraged for learning our models i.e. even on CIFAR10,
image grid are sampled. we learn a class-agnostic generative model.
Sidestepping Memory Bottlenecks. As Eq. 5 indicates,
the input to fθ when sampling the (n + 1)th pixel is a set of Training Setup. We vary the number of observed pixels S
size K +n – the initial K observations and the subsequent n randomly between 4 and 2048 (with uniform sampling in
samples. Unfortunately, our model’s memory requirement, log-scale), while the number of query samples Q is set to
due to the self-attention modules, grows cubically with this 2048. During training, the locations x are treated as varying
input size. This makes it infeasible to autoregressively sam- over a continuous domain, using bilinear sampling to obtain
ple a very large number of pixels. However, we empirically the corresponding value – this helps our implementation
observe that given a sufficient number of (random) samples, be agnostic to the image resolution in the dataset. While
subsequent pixel value distributions do not exhibit a high we train a separate network fθ for each dataset, we use the
variance. We leverage this observation to design a hybrid exact same model, hyper-parameters etc. across them.
sampling strategy. When generating an image with N pix- Qualitative Results: Mean Image Prediction. We first
els, we sample the first N 0 (typically 2048) autoregressively examine the expected image I¯ inferred by our model given
i.e. following Eq. 5 and Eq. 6. For the remaining N − N 0 some samples S. We visualize in Figure 3 our predictions
pixels, we simply use their mean value estimate conditioned on the Cat Faces dataset using varying number of input
on the initial and generated K + N 0 samples (using Eq. 4). samples. We observe that even when using as few as 4
While this may lead to some loss in detail, we qualitatively pixels in S, our model predicts a cat-like mean image that,
show that the effects are not prohibitive and that the sample with some exceptions, captures the coarse color accurately.
diversity is preserved.
PixelTransformer: Sample Conditioned Signal Generation

Figure 4. Image Samples. Sample images generated by our learned model on three datasets (left: MNIST, middle: Cat Faces, right:
CIFAR10) given |S| = 32 observed pixels. Top row: ground-truth image from which S is drawn. Row 2: A nearest neighbor visualization
of S – for each image pixel we assign it the color of the closest observed sample in S. Rows 3-5: Randomly sampled images from p(I|S).

Reconstruction Accuracy on CIFAR10 Classification Accuracy on CIFAR10

1.0 1.0
0.8 0.8

Accuracy
SSIM 0.6 0.6
0.4 0.4
0.2 Decoder + Optimization 0.2 GT Image
Ours (Mean Image) Mean Image
Ours (Image Samples) Image Samples
0.0 0 500 1000 1500 2000 0.0 0 500 1000 1500 2000
|S| |S|

Figure 6. Reconstruction Accu-Figure 7. Classification Accu-

racy of generated images. racy of generated images.

age given some pixels from top/bottom of two different

images. We see that, despite some mismatch in the align-
Figure 5. Image Composition. Generation results when drawing
pixels from two different images. Top row: the composed image ment/texture of the underlying faces, our model is able to
from which S is drawn. Row 2: A nearest neighbor visualization compose them to generate a plausible new image.
of S. Row 3: Randomly sampled image from p(I|S). Reconstruction and Classification Accuracy. In addition
to visually inspecting the mean and sampled images, we
A very small number of pixels, however, is not sufficiently
also quantitatively evaluate them using reconstruction and
informative of the pose/shape of the head, which become
classification based metrics on the CIFAR10 dataset. First,
more accurate given around 100 samples. As expected,
we measure how similar our obtained images are to the
the mean image becomes closer to the true image given
underlying ground-truth image. Figure 6 plots this accuracy
additional samples, with the later ones even matching finer
for varying size of S – we compute this plot using 128 test
details e.g. eye color, indicating that the distribution p(I|S)
images, varying |S| from 4 to 2048 for each. When reporting
reduces in variance as |S| increases.
the accuracy for sampled images, we draw 3 samples per
Qualitative Results: Sampling Images. While examining instance and use the average performance. We also report
the mean image assures us that our average prediction is a baseline that uses a pretrained decoder(from a VAE) and
meaningful, it does not inform us about samples drawn from optimizes the latent variable to best match the pixels in S
p(I|S). In Figure 4, we show results on images from each (see appendix for details). We observe that our predicted
of the three datasets considered using |S|=32 randomly ob- images, more so than the baseline, match the true image.
served pixel values in each case. We see that the sampled Additionally, the mean image is slightly more ‘accurate’ in
images vary meaningfully (e.g. face textures) while pre- terms of reconstruction than the sampled ones – perhaps
serving the coarse structure, though we do observe some because the diversity of samples makes them more different.
artefacts e.g. missing horse legs.
We also plot the classification accuracy of the generated
As an additional application, we can generate images by images in Figure 7. To do so, we use a pretrained ResNet-
mixing pixel samples from different images. We showcase 18 (He et al., 2016) based classifier and measure whether
some results in Figure 5 where we show one generated im- the correct class label is inferred from our generated images.
PixelTransformer: Sample Conditioned Signal Generation

Figure 8. Shape Generation. Sample 3D shapes generated given |S| = 32 observed SDF values at random locations. Top row: ground-
truth 3D shape. Row 2: A visualization of S – a sphere is centred at each position with color indicating value (red implies higher SDF).
Rows 3-5: Randomly sampled 3D shapes from our predicted conditional distribution.

Figure 10. Video Synthesis. Sample videos generated by our

Figure 9. Polynomial Prediction. Mean and sampled polynomi- model given |S|=1024 observed pixels across 34 frames. Top
als generated by our learned model. Row 1: Predictions using row: 4 uniformly sampled frames of the ground-truth video. Row
|S| = 4 samples (red dots). Row 1: Predictions using |S| = 6. 2: A nearest neighbor visualization of S. Rows 3-5: Randomly
sampled videos from the predicted conditional distribution.

Interestingly, we see that even if using images generated is applicable beyond images. In particular, assuming the
from as few as 16 pixels, we obtain about a 30% classi- availability of (unlabeled) examples, our approach can learn
fication accuracy (or over 60% with 128 pixels). As we to generate any dense spatial signal given some (position,
observe more pixels, the accuracy matches that of using the value) samples. In this section, we empirically demonstrate
ground-truth images. Finally, we see that using the sampled this by learning to generate 1D (polynomial) and 3D (shapes
images yields better results compared to the mean image, as and videos) signals using our framework.
the sampled ones look more ‘real’.
We would like to emphasize that across these settings, where
we are learning to generate rather different spatial signals,
6. Beyond Images: 1D and 3D Signals we use the same training objective and model design. That
While we leveraged our proposed framework for generat- is, except for the dimensionality of input/output layers and
ing images given some pixel observations, our formulation distribution parametrization to handle the corresponding
PixelTransformer: Sample Conditioned Signal Generation

inputs/outputs x ∈ Rx , v ∈ Rv , our model or learning 6.3. Synthesizing Videos

objective is not modified in any way specific to the domain. Lastly, we examine the domain of ‘higher-dimensional’ im-
ages (e.g. videos). In particular, we use the subset of ‘beach’
6.1. Polynomial Prediction
videos in the TinyVideos dataset (Vondrick et al., 2016;
As an illustrative example to study our method, we consider Thomee et al., 2016) (with a random 80% − 20% train-test
a classical task – given a sparse set of (x, g(x)) pairs, where split) and train our model to generate video clips with 34
x, g(x) ∈ R1 , we want to predict the value of g over its frames. Note that these naturally correspond to 3D spa-
domain. We randomly generate 6-degree polynomials, draw tial signals, as the position x includes timeframe ∈ R1 in
from 4 to 20 samples to obtain S, and learn fθ to predict addition to a pixel coordinate.
distribution of values at |Q|=20 query locations. One sim-
plification compared to the model used for images is we We train our model fθ to generate the underlying signal
use B = 1 instead of B = 256 (i.e. a simple gaussian distribution given sparse pixel samples where we randomly
distribution) to parametrize the output distribution. choose a frame and pixel coordinate for each sample. We
empirically observe that due to the high complexity of the
We visualize our predictions in Figure 9, where the columns output space, using only a small number of samples does
correspond to different polynomials, and the rows depict our not provide significant information for learning generation.
results with varying number of inputs in S. We see that the We therefore train our model using more samples than the
various sample functions we predict are diverse and mean- image generation task – varying |S| between 512 to 2048
ingful, while being constrained by the observed position, (this corresponds to 30 pixels per frame).
value pairs. Additionally, as the number of observations in
S increase, the variance of the function distribution reduces We present representative results in Figure 10 but also en-
and matches the true signal more closely. courage the reader to see the videos in the project page. Our
model generates plausible videos with some variation e.g.
6.2. Generating 3D Shapes flow of waves and captures the coarse structure of the output
well. However, the predictions lack precise detail. We at-
We next address the task of generating 3D shapes repre-
tribute this to the limited number of pixels we can generate
sented as signed distance fields (SDFs). We consider the
autoregressively (see discussion in Section 4.3 on memory
category of chairs using models from 3D Warehouse (3DW),
bottlenecks) and hypothesize that a higher number maybe
leveraging the subset recommended by Chang et al. (Chang
needed for modeling these richer signals.
et al., 2015). We use the train/test splits provided by (Xu
et al., 2019), with 5268 shapes used for training, and 1311
for testing. We extract a SDF representation for each shape 7. Discussion
as a grid of size 643 , with each location recording a con- We proposed a probabilistic generative model capable of
tinuous signed distance value – this dense representation is generating images conditioned on a set of random observed
better suited for our approach compared to sparse occupan- pixels, or more generally, synthesizing spatial signals given
cies. Our training procedure is exactly the same as the one sparse samples. At the core of our approach is a learned
used for 2D images – we sample the SDF grid at random function that predicts value distributions at any query loca-
locations to generate S, Q, with the number of samples in S tion given an arbitrary set of observed samples. While we
varying from 4 to 2048, and |Q| being 2048. obtain encouraging results across some domains, there are
several aspects which could be improved e.g. scalability,
We present some randomly chosen 3D shapes generated
perceptual quality, and handling sparse signals. To allow
by our model when using |S| = 32 in Figure 8. While
better scaling, it could be possible to generalize the outputs
we actually generate a per-location signed distance value,
from distributions over individual pixels to those over a
we extract a mesh using marching cubes to visualize this
vocabulary of tokens encoding local patches or investigate
prediction. As the results indicate, even when using only 32
strategies to better select conditioning subsets (e.g. nearest
samples from such a high-dimensional 3D spatial signal, our
samples). The perceptual quality of our results could be
model is able to generate diverse and plausible 3D shapes.
further improved and incorporating adversarial objectives
In particular, even though this is not explicitly enforced, our
maybe a promising direction. Finally, while our framework
model generates symmetric shapes and the variations are
allowed generating pixel values, we envision that a similar
semantically meaningful as well as globally coherent e.g.
approach could predict other dense properties of interest e.g.
slope of chair back, handles with or without holes. However,
semantic labels, depth, generic features.
as our model generates the SDF representation, and does
not directly produce a mesh, we often see some artefacts Acknowledgements. We would like to thank Deepak
in the resulting mesh e.g. disconnected components, which Pathak and the members of the CMU Visual Robot Learning
can occur when thresholding a slightly inconsistent SDF. lab for helpful discussions and feedback.
PixelTransformer: Sample Conditioned Signal Generation

References Murphy, K. P. Machine learning: a probabilistic perspective.

2012.
3D Warehouse. https://3dwarehouse.sketchup.
com/. Osher, S., Shi, Z., and Zhu, W. Low dimensional manifold
Bojanowski, P., Joulin, A., Lopez-Pas, D., and Szlam, A. model for image processing. SIAM J. Imaging Sci., 2017.
Optimizing the latent space of generative networks. In
Park, J. J., Florence, P., Straub, J., Newcombe, R., and
ICML, 2018.
Lovegrove, S. Deepsdf: Learning continuous signed
Brochu, E., Cora, V. M., and De Freitas, N. A tutorial distance functions for shape representation. In CVPR,
on bayesian optimization of expensive cost functions, 2019.
with application to active user modeling and hierarchical
reinforcement learning. arXiv preprint arXiv:1012.2599, Parmar, N. J., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer,
2010. N., and Ku, A. Image transformer. 2018. URL https:
//arxiv.org/abs/1802.05751.
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,
Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Rasmussen, C. E. Gaussian processes in machine learning.
Su, H., et al. Shapenet: An information-rich 3d model In Summer School on Machine Learning. Springer, 2003.
repository. arXiv preprint arXiv:1512.03012, 2015.
Razavi, A., van den Oord, A., and Vinyals, O. Generating
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Dhariwal, diverse high-fidelity images with vq-vae-2. In NeurIPS,
P., Luan, D., and Sutskever, I. Generative pretraining 2019.
from pixels. In ICML, 2020.
Rezende, D. J. and Mohamed, S. Variational inference with
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., normalizing flows. arXiv preprint arXiv:1505.05770,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. 2015.
Generative adversarial nets. In NeurIPS, 2014.
Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Pixelcnn++: Improving the pixelcnn with discretized lo-
learning for image recognition. In CVPR, 2016. gistic mixture likelihood and other modifications. ICRL,
Jain, A., Abbeel, P., and Pathak, D. Locally masked convo- 2017.
lution for autoregressive models. In UAI, 2020.
Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein,
Kingma, D. P. and Welling, M. Auto-encoding variational G., and Zollhofer, M. Deepvoxels: Learning persistent
bayes. arXiv preprint arXiv:1312.6114, 2013. 3d feature embeddings. In CVPR, 2019.

Krizhevsky, A. Learning multiple layers of features from Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wet-
tiny images. 2009. zstein, G. Implicit neural representations with periodic
activation functions. NeurIPS, 2020.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed- Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil,
ings of the IEEE, 86(11):2278–2324, 1998. S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron,
J., and Ng, R. Fourier features let networks learn high fre-
Li, K. and Malik, J. Implicit maximum likelihood estimation.
quency functions in low dimensional domains. NeurIPS,
arXiv preprint arXiv:1809.09087, 2018.
2020.
Martin-Brualla, R., Radwan, N., Sajjadi, M. S. M., Barron,
J. T., Dosovitskiy, A., and Duckworth, D. NeRF in the Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni,
Wild: Neural Radiance Fields for Unconstrained Photo K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The
Collections. In CVPR, 2021. new data in multimedia research. Communications of the
ACM, 59(2):64–73, 2016.
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S.,
and Geiger, A. Occupancy networks: Learning 3d recon- Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image
struction in function space. In CVPR, 2019. prior. In CVPR, 2018.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Ramamoorthi, R., and Ng, R. Nerf: Representing scenes Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and
as neural radiance fields for view synthesis. In ECCV, Kavukcuoglu, K. Wavenet: A generative model for raw
2020. audio. arXiv preprint arXiv:1609.03499, 2016a.
PixelTransformer: Sample Conditioned Signal Generation

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, Appendix
O., Graves, A., et al. Conditional image generation with
pixelcnn decoders. In NeurIPS, 2016b. Log-likelihood under Value Distribution. The pre-
dicted value distribution for a query position x is of the
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, form p(v; ω), where ω ≡ {(q b , µb , σ b )}B
b=1 . We reiterate
K. Pixel recurrent neural networks. arXiv preprint q b ∈ R1 is the probability of assignment to bin b, cb + µb
arXiv:1601.06759, 2016c. is the mean of the corresponding gaussian distribution with
uniform variance σ b ∈ R1 .
van den Oord, A., Vinyals, O., et al. Neural discrete repre-
sentation learning. In NeurIPS, 2017. Under this parametrization, we compute the log-likelihood
of a value v∗ by finding the closest bin b∗ , and computing
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, the log-likelihood of assignment to this bin as well as the
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention log-probability of the value under the corresponding gaus-
is all you need. In NeurIPS, 2017. sian. We additionally use a weight α = 0.1 to balance the
classification and gaussian log-likelihood terms.
Vondrick, C., Pirsiavash, H., and Torralba, A. Generating
videos with scene dynamics. In NeurIPS, 2016. b∗ = argminb kv ∗ − cb k
∗ ∗
Wu, S., Rupprecht, C., and Vedaldi, A. Unsupervised learn- ∗ ∗ v∗ − cb − µb 2
log p(v ∗ ; ω) ≡ log q b − α(log σ b + ( ) )
ing of probably symmetric deformable 3d objects from σ b∗
images in the wild. In CVPR, 2020.
VAE Training and Inference. We train a variational auto-
Xu, Q., Wang, W., Ceylan, D., Mech, R., and Neumann,
encoder (Kingma & Welling, 2013) on the CIFAR10 dataset
U. Disn: Deep implicit surface network for high-quality
with a bottleneck layer of dimension 4 × 4 × 64 i.e. spa-
single-view 3d reconstruction. In NeurIPS, 2019.
tial size 4 and feature size 64. We consequently obtain a
Zhang, K., Riegler, G., Snavely, N., and Koltun, V. Nerf++: decoder D which we use for inference given some observed
Analyzing and improving neural radiance fields. arXiv samples S. Specifically, we optimize for an optimal latent
preprint arXiv:2010.07492, 2020. variable the minimizes the reconstruction loss for the ob-
served samples (with an additional prior biasing towards
the zero vector). Denoting by I(x) the value of image I
(bilinearly sampled) at position x, the image I ∗ inferred
using a decoder D by optimizing over S can be computed
as:

z ∗ = argminz L(D(z), S) + 0.001 ∗ kzk2 ; I ∗ = D(z ∗ )

L(I, {(xk , vk )}) = E kI(xk ) − vk k1
k

Autoregressive Generative Models Guide
No ratings yet
Autoregressive Generative Models Guide
57 pages
Visual Gans
No ratings yet
Visual Gans
19 pages
Scalable Bayesian Imaging with Generative Priors
No ratings yet
Scalable Bayesian Imaging with Generative Priors
5 pages
Beery Synthetic Examples Improve Generalization For Rare Classes WACV 2020 Paper
No ratings yet
Beery Synthetic Examples Improve Generalization For Rare Classes WACV 2020 Paper
11 pages
Epstein Online Papers 2024
No ratings yet
Epstein Online Papers 2024
11 pages
Advance Deep Learning - BIT L4
No ratings yet
Advance Deep Learning - BIT L4
100 pages
Wavelet Demosaicing
No ratings yet
Wavelet Demosaicing
36 pages
3D Generative Models A Survey
No ratings yet
3D Generative Models A Survey
21 pages
Week 12 Foundations of Generative AIv2 2
No ratings yet
Week 12 Foundations of Generative AIv2 2
74 pages
Diffusion Csail Lecture Notes
No ratings yet
Diffusion Csail Lecture Notes
56 pages
Sem5 PPT
No ratings yet
Sem5 PPT
21 pages
Advances in AI-Generated Image Detection
No ratings yet
Advances in AI-Generated Image Detection
30 pages
Laumont etal22-BaysianImagingPnP
No ratings yet
Laumont etal22-BaysianImagingPnP
37 pages
Lecture 5 Autoregressive Models
No ratings yet
Lecture 5 Autoregressive Models
41 pages
Diffusion Models in Vision Survey
No ratings yet
Diffusion Models in Vision Survey
20 pages
Medical Image Synthesis with AI
No ratings yet
Medical Image Synthesis with AI
33 pages
Synthetic ECG Generation For Data Augmentation and Transfer Learning in Arrhythmia Classification
No ratings yet
Synthetic ECG Generation For Data Augmentation and Transfer Learning in Arrhythmia Classification
23 pages
Ijcai Ecai Tutorial
No ratings yet
Ijcai Ecai Tutorial
115 pages
Split: Image Decomposition For Fluorescence Micros
No ratings yet
Split: Image Decomposition For Fluorescence Micros
11 pages
Latent Score-based Generative Model
No ratings yet
Latent Score-based Generative Model
46 pages
Lec07 Modeling Images
No ratings yet
Lec07 Modeling Images
66 pages
Generative AI in Vision: A Survey On Models, Metrics and Applications
No ratings yet
Generative AI in Vision: A Survey On Models, Metrics and Applications
12 pages
Diffusion Models for Image Segmentation
No ratings yet
Diffusion Models for Image Segmentation
13 pages
Diffusion Models in Vision: A Survey: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
No ratings yet
Diffusion Models in Vision: A Survey: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
25 pages
Dis10 Sol
No ratings yet
Dis10 Sol
11 pages
Deep Generative Image Models Using A Laplacian Pyramid of Adversarial Networks
No ratings yet
Deep Generative Image Models Using A Laplacian Pyramid of Adversarial Networks
10 pages
Lec1 Intro
No ratings yet
Lec1 Intro
51 pages
2001c DeadLeaves LeeHuang Journal
No ratings yet
2001c DeadLeaves LeeHuang Journal
25 pages
Deep Image Prior for Image Restoration
No ratings yet
Deep Image Prior for Image Restoration
23 pages
Eneralization in Diffusion Models Arises From Geometry Adaptive Harmonic Representations
No ratings yet
Eneralization in Diffusion Models Arises From Geometry Adaptive Harmonic Representations
25 pages
Ardizzone2019 - Conditional Coupling Layers
No ratings yet
Ardizzone2019 - Conditional Coupling Layers
11 pages
Nips Ws 2017
No ratings yet
Nips Ws 2017
12 pages
6s184 Diffusion Model Notes
No ratings yet
6s184 Diffusion Model Notes
51 pages
1 Mais Citado
No ratings yet
1 Mais Citado
14 pages
cs236 Lecture3
No ratings yet
cs236 Lecture3
36 pages
Report: Trends in Generative Models
No ratings yet
Report: Trends in Generative Models
10 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision
No ratings yet
Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision
14 pages
Diffusion Models in Vision: A Survey: IEEE Transactions On Pattern Analysis and Machine Intelligence March 2023
No ratings yet
Diffusion Models in Vision: A Survey: IEEE Transactions On Pattern Analysis and Machine Intelligence March 2023
26 pages
Modeling The Joint Statistics of Images in The Wavelet Domain
No ratings yet
Modeling The Joint Statistics of Images in The Wavelet Domain
8 pages
Advanced Generative Modeling Techniques
No ratings yet
Advanced Generative Modeling Techniques
23 pages
Score Matching Yang Song
No ratings yet
Score Matching Yang Song
13 pages
saVANt PRE
No ratings yet
saVANt PRE
10 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Coeurdoux etal23-PnPGibbs
No ratings yet
Coeurdoux etal23-PnPGibbs
15 pages
Optimizing Diffusion Model Design
No ratings yet
Optimizing Diffusion Model Design
13 pages
Automated System For Denoising Gray-Scale Images Using Image Priors
No ratings yet
Automated System For Denoising Gray-Scale Images Using Image Priors
7 pages
Kim2019 Article LatentTransformationsNeuralNet
No ratings yet
Kim2019 Article LatentTransformationsNeuralNet
15 pages
A Comparison of Latent Space Modeling Techniques in A Plain Vanilla Autoencoder Setting
No ratings yet
A Comparison of Latent Space Modeling Techniques in A Plain Vanilla Autoencoder Setting
35 pages
Lec 19
No ratings yet
Lec 19
111 pages
Generalization of VAE
No ratings yet
Generalization of VAE
30 pages
Bicycle Gan
No ratings yet
Bicycle Gan
12 pages
Generative Pretraining From Pixels
No ratings yet
Generative Pretraining From Pixels
13 pages
Background: Image Transformer
No ratings yet
Background: Image Transformer
6 pages
PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models
No ratings yet
PFGM++ - Unlocking The Potential of Physics-Inspired Generative Models
23 pages
Fields of Experts
No ratings yet
Fields of Experts
25 pages
1 - 2022 - NeurlPS - GMMSeg Gaussian Mixture Based Generative Semantic Segmentation Models
No ratings yet
1 - 2022 - NeurlPS - GMMSeg Gaussian Mixture Based Generative Semantic Segmentation Models
21 pages
Chapter 6 Assignment
No ratings yet
Chapter 6 Assignment
2 pages
Holiday Homework Xii (Science) 2025 26
No ratings yet
Holiday Homework Xii (Science) 2025 26
19 pages
Fpsyg 12 584333
No ratings yet
Fpsyg 12 584333
12 pages
B.Tech Mechanical Engg Syllabus
No ratings yet
B.Tech Mechanical Engg Syllabus
94 pages
NAAC 2013 SSR Vol2 Part2
No ratings yet
NAAC 2013 SSR Vol2 Part2
201 pages
Dap Report
No ratings yet
Dap Report
4 pages
GPG245 Desktop Guide To Daylighting For Architects PDF
100% (1)
GPG245 Desktop Guide To Daylighting For Architects PDF
12 pages
MRT301 Exam Q20132 Marketing Research Theory
No ratings yet
MRT301 Exam Q20132 Marketing Research Theory
3 pages
Mishra-2022-APJML-Battle Between Psychological Ownership and Consumer Animosity To Influence Consumers Buying Behavior-A Moderated Mediation Model
No ratings yet
Mishra-2022-APJML-Battle Between Psychological Ownership and Consumer Animosity To Influence Consumers Buying Behavior-A Moderated Mediation Model
18 pages
Using Llms For Market Research: James Brand Ayelet Israeli Donald Ngwe
No ratings yet
Using Llms For Market Research: James Brand Ayelet Israeli Donald Ngwe
48 pages
I
No ratings yet
I
261 pages
ELX 312 Pneumatic
No ratings yet
ELX 312 Pneumatic
6 pages
Awareness and Utilization of Artificial Intelligence (AI) Tools For Enhanced Research Among Postgraduate Students in Universities in Benue State
No ratings yet
Awareness and Utilization of Artificial Intelligence (AI) Tools For Enhanced Research Among Postgraduate Students in Universities in Benue State
9 pages
Chapter 5 Multicollinearity
No ratings yet
Chapter 5 Multicollinearity
20 pages
Department of Computer Science and Engineering Coding Assignment For Deep Learning CSE754
No ratings yet
Department of Computer Science and Engineering Coding Assignment For Deep Learning CSE754
6 pages
Moes Advertisement 28th July 2025, Shanmugaiah
No ratings yet
Moes Advertisement 28th July 2025, Shanmugaiah
1 page
Do Students' Beliefs About Writing Relate To Their Writing Self-Efficacy
No ratings yet
Do Students' Beliefs About Writing Relate To Their Writing Self-Efficacy
21 pages
Inrtoduction To Public Relations PDF
No ratings yet
Inrtoduction To Public Relations PDF
21 pages
Philippine Indigenous Communities Syllabus
No ratings yet
Philippine Indigenous Communities Syllabus
25 pages
Credible Resources For Research PowerPoint Presentation
No ratings yet
Credible Resources For Research PowerPoint Presentation
25 pages
Safety Culture & Climate Review
No ratings yet
Safety Culture & Climate Review
18 pages
Genetic Origins of Native Americans
No ratings yet
Genetic Origins of Native Americans
22 pages
Winning in A Business 4 0 World PDF
100% (1)
Winning in A Business 4 0 World PDF
56 pages
Sappppppp 1
No ratings yet
Sappppppp 1
127 pages
3 Admission Requirements To Postgraduate Programs
No ratings yet
3 Admission Requirements To Postgraduate Programs
3 pages
A. Pleasure Values
No ratings yet
A. Pleasure Values
13 pages
Rational Decision-Making Model Guide
No ratings yet
Rational Decision-Making Model Guide
6 pages
Business Strategy Concepts Lecture Notes
100% (1)
Business Strategy Concepts Lecture Notes
49 pages
Does Digital Influencer Endorsement Contribute To
No ratings yet
Does Digital Influencer Endorsement Contribute To
16 pages
Research Methodology for Sexuality Education Study
No ratings yet
Research Methodology for Sexuality Education Study
3 pages
Understanding Peer Pressure Dynamics
No ratings yet
Understanding Peer Pressure Dynamics
22 pages

PixelTransformer - Sample Conditioned Signal Generation

Uploaded by

PixelTransformer - Sample Conditioned Signal Generation

Uploaded by

PixelTransformer: Sample Conditioned Signal Generation

Shubham Tulsiani 1 Abhinav Gupta 1 2

Abstract sive approaches directly model the joint distribution over

We propose a generative model that can infer a

2014), variational bound on the log-likelihood (Kingma

4.2. Inferring Marginals and Mean

Reconstruction Accuracy on CIFAR10 Classification Accuracy on CIFAR10

Figure 6. Reconstruction Accu-Figure 7. Classification Accu-

age given some pixels from top/bottom of two different

Figure 10. Video Synthesis. Sample videos generated by our

inputs/outputs x ∈ Rx , v ∈ Rv , our model or learning 6.3. Synthesizing Videos

References Murphy, K. P. Machine learning: a probabilistic perspective.

z ∗ = argminz L(D(z), S) + 0.001 ∗ kzk2 ; I ∗ = D(z ∗ )

You might also like