From @chrishiestand on September 29, 2017 4:30
The gist is bigquery sometimes silently drops one or more streams when many streams are used in parallel (42 in this example). If I'm hitting a quota an error ought to be thrown, but no errors are thrown the program runs as expected but in the end there are rows missing from the bigquery data.
If there is a bug, it might be in gcloud-node, or it might be in the bigquery api. Both seem less likely than me having made a mistake, so I hope you can find something I've done wrong.
The tricky part is the bug doesn't always reproduce. When the bug does reproduce n streams of size s are dropped so bigquery will have n * s missing rows. So if n=2 and s=150 bigquery will be missing 300 rows. In other words, the problem does not appear to be that a subset of stream data is missing, but rather one or more entire streams are missing.
This seems to reproduce reliably sometimes, and other times it reliably does not reproduce. To try and get the opposite result, try again later and/or change the stream load with the env variables.
This small bug reproduction project https://github.com/chrishiestand/gcloud-node-bigquery-manystreams-bugis the result of troubleshooting missing big query data in a production system where 50 streams are processed in parallel.
Below is a screenshot of the reproduction repository showing a reproduction of the issue.

In contrast, if I reduce the number of streams from 42 to 10, the tests pass as below:

Environment details
- OS: OS X 10.12.6
- Node.js version: 8.6.0
- npm version: 5.3.0
- google-cloud-node version: @google-cloud/bigquery = 0.9.6
Steps to reproduce
Please go here for a reproduction project: https://github.com/chrishiestand/gcloud-node-bigquery-manystreams-bug
Copied from original issue: googleapis/google-cloud-node#2635
From @chrishiestand on September 29, 2017 4:30
The gist is bigquery sometimes silently drops one or more streams when many streams are used in parallel (42 in this example). If I'm hitting a quota an error ought to be thrown, but no errors are thrown the program runs as expected but in the end there are rows missing from the bigquery data.
If there is a bug, it might be in gcloud-node, or it might be in the bigquery api. Both seem less likely than me having made a mistake, so I hope you can find something I've done wrong.
The tricky part is the bug doesn't always reproduce. When the bug does reproduce
nstreams of sizesare dropped so bigquery will haven*smissing rows. So ifn=2ands=150bigquery will be missing 300 rows. In other words, the problem does not appear to be that a subset of stream data is missing, but rather one or more entire streams are missing.This seems to reproduce reliably sometimes, and other times it reliably does not reproduce. To try and get the opposite result, try again later and/or change the stream load with the env variables.
This small bug reproduction project https://github.com/chrishiestand/gcloud-node-bigquery-manystreams-bugis the result of troubleshooting missing big query data in a production system where 50 streams are processed in parallel.
Below is a screenshot of the reproduction repository showing a reproduction of the issue.
In contrast, if I reduce the number of streams from 42 to 10, the tests pass as below:

Environment details
Steps to reproduce
Please go here for a reproduction project: https://github.com/chrishiestand/gcloud-node-bigquery-manystreams-bug
Copied from original issue: googleapis/google-cloud-node#2635