Tag: database

Create and populate a SQL database using astrodbkit

The astrodbkit package can be used to modify an existing SQL database (such as The BDNYC Database) but it can also be used to create and populate a SQL database from scratch.

To do this, import the BDdb module and create a new database with

from astrodbkit import astrodb
dbpath = '/path/to/new_database.db'
astrodb.create_databse(dbpath)

Then load your new database with

db = astrodb.get_database(dbpath)

and start adding tables! The db.table() method accepts as its arguments the table name, list of field names, and list of data types like so:

db.table('my_new_table', ['field1','field2'], ['INTEGER','TEXT'], new_table=True)

Note new_table=True is necessary to create a new table. Otherwise, it looks for an existing table to modify (which you could do as well!).

To populate your new database with data, read the documentation here or a summary at Adding Data to the BDNYC Database.

As always, I recommend the SQLite Browser for a nice GUI to make changes outside of the command line.

Happy databasing!

Deploying Package Releases to PyPI

So you’ve made some cool new improvements to your Python package and you want to deploy a new release. It’s just a few easy steps.

  1. After your package modules have been updated, you have to update the version number in your __init__.py file. The format is major.minor.micro depending on what you changed. For example, a small bug fix to v0.2.3 would increment to v0.2.4 while a new feature might increment to v0.3.0.
  2. Then make sure all the changes are committed and pushed to Github.
  3. Build with python setup.py sdist
  4. Twine with twine upload dist/compressed_package_filename

That’s it!

Adding Data to the BDNYC Data Archive

To add data to any table in BDNYC.db, there are two easy steps. As our working example, we’ll add some new objects to the SOURCES table.

Step 1: Create an ascii file of the data

Put the data to be added in an ascii file with the following formatting:

  1. The first line must be the delimiter-separated column names to insert/update, e.g. ra|dec|publication_id. Note that the column names in the ascii file need not be in the same order as the table. Also, only the column names that match will be added and non-matching or missing column names will be ignored, e.g. spectral_type|ra|publication_id|dec will ignore the spectral_type values as this is not a column in the SOURCES table and input the other columns in the correct places.
  2. If a record (i.e. a line in your ascii file) has no value for a particular column, type nothing. E.g. for the given column names ra|dec|publication_id|comments, a record with no publication_id should read 34.567|12.834||This object is my favorite!.
  3. Be mindful of your delimiter and be sure not to use it in any of the record values!

Step 2: Add the data to the specified table

To add the data to the table (in our example, the SOURCES table), import BDdb.py and initialize the .db file. Then run the add_data() method with the path to the ascii file as the first argument and the table to add the data to as the second argument. Here’s what that looks like:

from BDNYCdb import BDdb
db = BDdb.get_db('/path/to/the/database/file.db')
db.add_data('/path/to/the/upload/file.csv', 'sources', delimiter='|')

If you want to use a delimiter other than ‘|’, just specify it with the *delimiter argument.

That’s it!

The BDNYC Data Archive

Setting Up the Database

To install, just do:

pip install BDNYCdb

Then download the bdnyc198.db database file. This initial release contains the astrometry, photometry and spectra for the 198 objects in the Filippazzo et al. (2015) sample.

To use the published and unpublished spectra, photometry and astrometry for all 1300 sources, ask someone who has access to the private database to share the BDNYCdb Dropbox folder with you.

Accessing the Database

Create a database instance by launching the Python interpreter and pointing the get_db() function to the database file by doing:

import BDdb
db = BDdb.get_db(filepath)

Voila! You can see an inventory of all data for a specific source by passing a source_id to the inventory() method.

db.inventory(767)

This will also plot all available spectra for that source for visual inspection if you set plot=True.

Querying the Database

Now that you have the database at your fingertips, you’ll want to get some info out of it. To do this, you can pass SQL queries wrapped in double-quotes (“) to the query() method.

data = db.query( "SQL_query_goes_here" )

Here is a detailed post about how to write a SQL query.

The result of a SQL query is a sequence of tuples with the data requested from each record. For example, we can get a source’s photometry from the PHOTOMETRY table with:

db.query("select band,magnitude from photometry where source_id=202")

which gives the output

[('J', 13.526),('H', 12.807),('Ks', 12.503),('W1', 12.486),('W2', 12.386),('W3', 12.313),('W4', 8.525)]

Alternatively, we can have this data returned as a Python dictionary with:

db.query("select band,magnitude from photometry where source_id=202", DICT=True)

which looks like

[{'band': 'J', 'magnitude': 13.526},{'band': 'H', 'magnitude': 12.807},{'band': 'Ks', 'magnitude': 12.503},{'band': 'W1', 'magnitude': 12.486},{'band': 'W2', 'magnitude': 12.386},{'band': 'W3', 'magnitude': 12.313},{'band': 'W4', 'magnitude': 8.525}]

Example Queries

Some SQL query examples to pass to the query() method (wrapped in double-quotes (“) of course) are:

  1. SELECT shortname, ra, dec FROM sources WHERE (222<ra AND ra<232) AND (5<dec AND dec<15)
  2. SELECT band, magnitude, magnitude_unc FROM photometry WHERE source_id=58
  3. SELECT source_id, band, magnitude FROM photometry WHERE band=’z’ AND magnitude<15
  4. SELECT wavelength, flux, unc FROM spectra WHERE observation_id=75”

As you hopefully gathered:

  1. Returns the shortname, ra and dec of all objects in a 10 square degree patch of sky centered at RA = 227, DEC = 10
  2. Returns all the photometry and uncertainties available for object 58
  3. Returns all objects and z magnitudes with z less than 15
  4. Returns the wavelength, flux and uncertainty arrays for all spectra of object 75

The above examples are for querying individual tables only. We can query from multiple tables at the same time with the JOIN command like so:

  1. SELECT t.name, p.band, p.magnitude, p.magnitude_unc FROM telescopes as t JOIN photometry AS p ON p.telescope_id=t.id WHERE p.source_id=58
  2. SELECT p1.magnitude-p2.magnitude FROM photometry AS p1 JOIN photometry AS p2 ON p1.source_id=p2.source_id WHERE p1.band=’J’ AND p2.band=’H’
  3. SELECT src.designation, src.unum, spt.spectral_type FROM sources AS src JOIN spectral_types AS spt ON spt.source_id=src.id WHERE spt.spectral_type>=10 AND spt.spectral_type<20 AND spt.regime=’optical’
  4. SELECT s.unum, p.parallax, p.parallax_unc, p.publication_id FROM sources as s JOIN parallaxes AS p ON p.source_id=s.id

As you may have gathered:

  1. Returns the survey, band and magnitude for all photometry of source 58
  2. Returns the J-H color for every object
  3. Returns the designation, U-number and optical spectral type for all L dwarfs
  4. Returns the parallax measurements and publications for all sources

Database Schema and Browsing

In order to write the SQL queries above you of course need to know the names of the fields in each table are. One way to do this is:

db.query("PRAGMA table_info('PHOTOMETRY')")

SQL browserEven easier is to use the DB Browser for SQLite pictured at left which lets you expand and collapse each table, sort and order columns, and other fun stuff.

It even allows you to manually create/edit/destroy records with a very nice GUI.

IMPORTANT: Keep in mind that if you change a database record, you immediately change it for everyone since we share the same database file on Dropbox. Be careful!

Always check and double-check that you are entering the correct data for the correct source before you save any changes with the SQLite Database Browser.

SQL Queries

An SQL database is comprised of a bunch of tables (kind of like a spreadsheet) that have fields (column names) and records (rows of data). For example, our database might have a table called students that looks like this:

id first last grade GPA
1 Al Einstein 6 2.7
2 Annie Cannon 6 3.8
3 Artie Eddington 8 3.2
4 Carlie Herschel 8 3.2

So in our students table, the fields are [id, first, last, grade, GPA], and there are a total of four records, each with a required yet arbitrary id in the first column.

To pull these records out, we tell SQL to SELECT values for the following fields FROM a certain table. In SQL this looks like:

In [1]: db.execute("SELECT id, first, last, grade, GPA FROM students").fetchall()
Out[1]: [(1,'Al','Einstein',6,2.7),(2,'Annie','Cannon',6,3.8),(3,'Artie','Eddington',8,3.2),(4,'Carlie','Herschel',8,3.2)]

Or equivalently, we can just use a wildcard “*” if we want to return all fields with the SQL query "SELECT * FROM students".

We can modify our SQL query to change the order of fields or only return certain ones as well. For example:

In [2]: db.execute("SELECT last, first, GPA FROM students").fetchall()
Out[1]: [('Einstein','Al',2.7),('Cannon','Annie',3.8),('Eddington','Artie',3.2),('Herschel','Carlie',3.2)]

Now that we know how to get records from tables, we can restrict which records it returns with the WHERE statement:

In [3]: db.execute("SELECT last, first, GPA FROM students WHERE GPA>3.1").fetchall()
Out[3]: [('Cannon','Annie',3.8),('Eddington','Artie',3.2),('Herschel','Carlie',3.2)]

Notice the first student had a GPA less than 3.1 so he was omitted from the result.

Now let’s say we have a second table called quizzes which is a table of every quiz grade for all students that looks like this:

id student_id quiz_number score
1 1 3 89
2 2 3 96
3 3 3 94
4 4 3 92
5 1 4 78
6 3 4 88
7 4 4 91

Now if we want to see only Al’s grades, we have to JOIN the tables ON some condition. In this case, we want to tell SQL that the student_id (not the id) in the quizzes table should match the id in the students table (since only those grades are Al’s). This looks like:

In [4]: db.execute("SELECT quizzes.quiz_number, quizzes.score FROM quizzes JOIN students ON students.id=quizzes.student_id WHERE students.last='Einstein'").fetchall()
Out[4]: [(3,89),(4,78)]

So students.id=quizzes.student_id associates each quiz with a student from the students table and students.last='Einstein' specifies that we only want the grades from the student with last name Einstein.

Similarly, we can see who scored 90 or greater on which quiz with:

In [5]: db.execute("SELECT students.last, quizzes.quiz_number, quizzes.score FROM quizzes JOIN students ON students.id=quizzes.student_id WHERE quizzes.score>=90").fetchall()
Out[5]: [('Cannon',3,96),('Eddington',3,94),('Herschel',3,92),('Herschel',4,91)]

That’s it! We can JOIN as many tables as we want with as many restrictions we need to pull out data in the desired form.

This is powerful, but the queries can become lengthy. A slight shortcut is to use the AS statement to assign a table to a variable (e.g. students => s, quizzes => q) like such:

In [6]: db.execute("SELECT s.last, q.quiz_number, q.score FROM quizzes AS q JOIN students AS s ON s.id=q.student_id WHERE q.score>=90").fetchall()
Out[6]: [('Cannon',3,96),('Eddington',3,94),('Herschel',3,92),('Herschel',4,91)]