Skip to content

Metrics server /api-metrics endpoint fails with 500 Internal Server Error without helpfull diagnostic #2332

@gberche-orange

Description

@gberche-orange

Describe the bug

The /api-metrics endpoint fails with the following response

$ curl -vvv --key metrics_server_key.crt --cacert metrics_server_ca.crt --cert metrics_server_cert.crt https://bosh-master.internal.paas:9091/api_metrics
< HTTP/1.1 500 Internal Server Error                                                                  
< Server: nginx                                                                                                                                                                                             
< Date: Thu, 28 Oct 2021 13:53:40 GMT                                                                                                                                                                       
< Content-Length: 5071                                                                                
< Connection: keep-alive                                                                              
<                                                                                                     
Puma caught this error: undefined method `unpack' for nil:NilClass (NoMethodError)
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:238:in `block (2 levels) in all_values'                                                                                         
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:236:in `each'             
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:236:in `map'
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:236:in `block in all_values'               
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:271:in `with_file_lock'   
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:235:in `all_values'       
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:124:in `block in all_values'                                              
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:121:in `each'             
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:121:in `all_values'       
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/histogram.rb:84:in `values'                                
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/formats/text.rb:30:in `block in marshal'                   
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/formats/text.rb:26:in `each'                               
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/formats/text.rb:26:in `marshal'                            
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/middleware/exporter.rb:69:in `respond_with'                       
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/middleware/exporter.rb:30:in `call'                               
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/middleware/collector.rb:32:in `block in call'                     
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/middleware/collector.rb:61:in `block in trace'                    
/var/vcap/data/packages/ruby-2.6.5-r0.29.0/b20c017a80a93122d1c536bbd2cfe28f2f6aef97/lib/ruby/2.6.0/benchmark.rb:308:in `realtime'                                                                           
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/middleware/collector.rb:61:in `trace'                             
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/middleware/collector.rb:32:in `call'                              
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/rack-2.2.3/lib/rack/common_logger.rb:38:in `call'                                                        
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/sinatra-2.0.8.1/lib/sinatra/base.rb:231:in `call'                                                        
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/puma-3.12.6/lib/puma/configuration.rb:227:in `call'                                                      
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/puma-3.12.6/lib/puma/server.rb:706:in `handle_request'                                                   
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/puma-3.12.6/lib/puma/server.rb:476:in `process_client'                                                   
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/puma-3.12.6/lib/puma/server.rb:334:in `block in run'                                                     
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/puma-3.12.6/lib/puma/thread_pool.rb:135:in `block in spawn_thread'                                       

This is not reproducing on another environment, so I'm suspecting this is related to a problem in this lab environment, but this is hard to diagnose.

To Reproduce

Enabled metrics server as suggested in https://bosh.io/docs/director-metrics/

Expected behavior

  • Helpful diagnostic message about invalid output being parsed
  • Hints on how to troubleshoot the issue

Logs

the metrics-server logs are not showing errors

==> metrics_server.stderr.log <==
127.0.0.1 - - [28/Oct/2021:13:58:06 +0000] "GET /metrics HTTP/1.1" 200 - 0.0062

==> metrics_server.stdout.log <==
I, [2021-10-28T13:58:22.017392 #11] []  INFO -- MetricsServer: populating metrics
D, [2021-10-28T13:58:22.019427 #11] [] DEBUG -- Director: (0.000547s) (conn: 47272601494620) SELECT * FROM "director_attributes" WHERE ("name" = 'resurrection_paused') LIMIT 1
D, [2021-10-28T13:58:22.028558 #11] [] DEBUG -- Director: (0.008083s) (conn: 47272601494620) SELECT DISTINCT "type", count(*) AS "count" FROM "tasks" GROUP BY "type"
D, [2021-10-28T13:58:22.030640 #11] [] DEBUG -- Director: (0.000506s) (conn: 47272601494620) SELECT "type", "state", count(*) AS "count" FROM "tasks" WHERE ("state" IN ('processing', 'queued')) GROUP BY "type", "state"
D, [2021-10-28T13:58:22.034120 #11] [] DEBUG -- Director: (0.000447s) (conn: 47272601494620) SELECT * FROM "configs" WHERE ("id" IN (SELECT max("id") FROM "configs" WHERE ("type" = 'cloud') GROUP BY "name"))
D, [2021-10-28T13:58:22.080868 #11] [] DEBUG -- Director: (0.000284s) (conn: 47272601494620) SELECT count(*) AS "count" FROM "ip_addresses" WHERE (("network_name" = 'redacted') AND ("static" IS FALSE)) LIMIT 1
[...]
I, [2021-10-28T13:58:22.091951 #11] []  INFO -- MetricsServer: populated metrics

The director stderr logs include the same stack trace

/var/vcap/sys/log/director# less director.stderr.log
2021-10-28 15:06:54 +0000: Rack app error handling request { GET /metrics }
#<NoMethodError: undefined method `unpack' for nil:NilClass>
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:238:in `block (2 levels) in all_values'                                                                                         
/var/vcap/data/packages/director/4aade3c707199631e6d710c7f75a12f332b14bf3/gem_home/ruby/2.6.0/gems/prometheus-client-1.0.0/lib/prometheus/client/data_stores/direct_file_store.rb:236:in `each'             
[...]

Versions (please complete the following information):

bosh 271.8.0*
running on vsphere

Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
BOSH or any other releases on top of BOSH.

If you used any deployment strategy it'd be helpful to point it out and share as
much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)

Additional context

The /var/vcap/store/director/metrics directory has associated metric_http_server_exceptions_total___25.bin file with NoMethodError content no much helping

bosh/c50e9533-15cf-447f-80c0-53c683c5a3cb:/var/vcap/store/director/metrics# ls -alrt /var/vcap/store/director/metrics | tail
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:26 metric_http_server_request_duration_seconds___21.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:29 metric_http_server_exceptions_total___37.bin
drwxr-xr-x 2 vcap vcap    4096 Oct 28 14:29 .
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:31 metric_http_server_exceptions_total___25.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:33 metric_http_server_requests_total___37.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:33 metric_http_server_request_duration_seconds___37.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:33 metric_http_server_requests_total___23.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:33 metric_http_server_request_duration_seconds___23.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:33 metric_http_server_requests_total___25.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:33 metric_http_server_request_duration_seconds___25.bin


bosh/c50e9533-15cf-447f-80c0-53c683c5a3cb:/var/vcap/store/director/metrics# cat metric_http_server_exceptions_total___22.bin
0exception=NoMethodError     k@

The problem is'nt new on this environment.

bosh/c50e9533-15cf-447f-80c0-53c683c5a3cb:/var/vcap/store/director/metrics# grep NoMethodError *
Binary file metric_http_server_exceptions_total___20.bin matches
Binary file metric_http_server_exceptions_total___21.bin matches
Binary file metric_http_server_exceptions_total___22.bin matches
Binary file metric_http_server_exceptions_total___23.bin matches
Binary file metric_http_server_exceptions_total___24.bin matches
Binary file metric_http_server_exceptions_total___25.bin matches
Binary file metric_http_server_exceptions_total___28.bin matches
Binary file metric_http_server_exceptions_total___29.bin matches
Binary file metric_http_server_exceptions_total___31.bin matches
Binary file metric_http_server_exceptions_total___35.bin matches
Binary file metric_http_server_exceptions_total___37.bin matches

bosh/c50e9533-15cf-447f-80c0-53c683c5a3cb:/var/vcap/store/director/metrics# ls -alrt *exceptions*
-rw-r--r-- 1 vcap vcap 1048576 May  7 10:25 metric_http_server_exceptions_total___31.bin
-rw-r--r-- 1 vcap vcap 1048576 May  9 19:50 metric_http_server_exceptions_total___22.bin
-rw-r--r-- 1 vcap vcap 1048576 May 11 14:50 metric_http_server_exceptions_total___20.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct  6 16:10 metric_http_server_exceptions_total___28.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct  6 16:20 metric_http_server_exceptions_total___29.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 20 08:30 metric_http_server_exceptions_total___24.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:06 metric_http_server_exceptions_total___35.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:22 metric_http_server_exceptions_total___21.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:29 metric_http_server_exceptions_total___37.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:31 metric_http_server_exceptions_total___25.bin
-rw-r--r-- 1 vcap vcap 1048576 Oct 28 14:36 metric_http_server_exceptions_total___23.bin

How does the api-metrics endpoint collects its metrics ? Any way to inspect it to diagnose the root cause ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions