BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files.

# Description of the problem / feature request:

Multi-byte UTF-8 sequences in BUILD files get parsed as multiple distinct characters instead of a single code point

### Feature requests: what underlying problem are you trying to solve with this feature?

It is useful to have full Unicode support for attributes.

### Bugs: what's the simplest, easiest way to reproduce this bug?

- a rule with one string attribute
- the implementation of the rule should ctx.actions.write(..., content={that attribute})
- observe an instance where the value of the attribute contains a character like the copyright symbol

Repo here: https://github.com/tonyaiuto/bazel/tree/master/utf8_encode

### What operating system are you running Bazel on?
Linux

### What's the output of `bazel info release`?
bazel 1.1.0

###  Have you found anything relevant by searching the web?

#4551 is similar. 


### Analysis from the sample repo.

The iso8859-Latin-1 copyright symbol is 0xA9. The UTF-8 encoding of that
is the two bytes [c2, a9]. What we see in the output file is clearly
those two bytes again encoded as utf-8.

The BUILD file is in UTF-8 format:

```
grep copyright_notice BUILD | od -c -t x1
0000000                   c   o   p   y   r   i   g   h   t   _   n   o
         20  20  20  20  63  6f  70  79  72  69  67  68  74  5f  6e  6f
0000020   t   i   c   e       =       "   C   o   p   y   r   i   g   h
         74  69  63  65  20  3d  20  22  43  6f  70  79  72  69  67  68
0000040   t     302 251       2   0   1   9       T   o   n   y       A
         74  20  c2  a9  20  32  30  31  39  20  54  6f  6e  79  20  41
0000060   i   u   t   o   "   ,  \n
         69  75  74  6f  22  2c  0a
0000067
```

We can see the signature of the encoded copyright symbol as 302 251.

It would seem that
-   the BUILD file is parsed as a stream of octets, each one becoming a
    distinct character [0xc2, 0xa9].
-   write() presumes the file should be UTF-8 encoded and converts the 2
    characters into the 4 need for their UTF-8 representation.

## Potential fixes

### BUILD files are UTF-8

-   Pro: Easy to implement and understand.
-   Con: A breaking change, but probably little used. (Reasoning: encodings are
    broken anyway, so who could be using anything beyond ASCII successfully)

### Allow BUILD files to specify an encoding

We could borrow from (PEP-263)[https://www.python.org/dev/peps/pep-0263/]
and use:

```
# -*- coding: <encoding name> -*-
```

-   Pro: Not a breaking change.
-   Con: Feature bloat.


### Byte are bytes

ctx.actions.write() should not assume an encoding for the output. It
would emit exactly what it is given.

-   Pro: Easy to implement.
-   Con: Breaking change.
-   Con: Hard to defend as the most useful option.


### More switches and bells.

ctx.actions.write() could have an encoding attribute. Use 'none' for this case.

-   Con: Feature bloat.
-   Con: Dubious value. 
-   Con: Still does not fix the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug?

What operating system are you running Bazel on?

What's the output of `bazel info release`?

Have you found anything relevant by searching the web?

Analysis from the sample repo.

Potential fixes

BUILD files are UTF-8

Allow BUILD files to specify an encoding

Byte are bytes

More switches and bells.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174

Description

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug?

What operating system are you running Bazel on?

What's the output of bazel info release?

Have you found anything relevant by searching the web?

Analysis from the sample repo.

Potential fixes

BUILD files are UTF-8

Allow BUILD files to specify an encoding

Byte are bytes

More switches and bells.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What's the output of `bazel info release`?