This page is an IPython notebook. If you have the actual notebook file (not a PDF or HTML file generated from it), and you have a version of IPython installed that uses the same Python as the LSST stack, you can run all of the Python commands on this page. You'll need to change the DATA_DIR variable in the first line to something appropriate for you, and run hscProcessCcd.py on at least one CCD first (and if it's not the same CCD I used here, you'll have to change that variable as well).
The Butler is the object that manages all of the pipeline's inputs and outputs. It manages the directory structure, so you don't have to - while you can read the files directly, it's always better to use the Butler if you can. To use it, you'll need that basic directory structure to be setup already, including a couple of special files. In this tutorial, we'll assume that's already been done.
import lsst.daf.persistence
DATA_DIR = "/home/jbosch/HSC/tutorial/data"
butler = lsst.daf.persistence.Butler(DATA_DIR)
Using the butler to get a dataset is just a matter of knowing the name of the dataset its "data ID", a dictionary that describes the particular unit of data you want. The dataset name for a CCD-level calibrated image is "calexp", and for a data ID, we provide the visit and CCD numbers:
exposure = butler.get("calexp", visit=1238, ccd=40, immediate=True)
We could also pass the data ID as a dict, which is useful if we want to reuse it:
dataId = {"visit": 1238, "ccd": 40}
exposure = butler.get("calexp", dataId, immediate=True)
The immediate=True
argument isn't strictly necessary, but it can help make bugs in later code harder to track down. Without it, instead of returning the thing you want, the Butler returns a proxy object that mimics the object you want, and doesn't actually load that dataset until you try to use it. Most of the time, that just causes confusion, so it's best to just use immediate=True
to disable the proxy behavior and have the Butler load and return the dataset immediately.
The Python exposure
object we get from the butler in this case is an instance of lsst.afw.image.ExposureF
, which is a combination of several things, including: - getMaskedImage()
returns an lsst.afw.image.MaskedImageF
, which is itself a collection of three images: - getImage()
returns an lsst.afw.image.ImageF
, containing the science image - getMask()
returns an lsst.afw.image.MaskU
, containing bad pixel masks - getVariance()
returns an lsst.afw.image.ImageF
, containing the per-pixel variance - getPsf()
returns an lsst.afw.detection.Psf
, our model for the point-spread function - getWcs()
returns an lsst.afw.image.Wcs
, with the image's astrometric registration - getCalib()
returns an lsst.afw.image.Calib
, with the image's photometric calibration - getMetadata()
returns an lsst.daf.base.PropertySet
, with miscellaneous other metadata
An Exposure
is saved as a multi-extension FITS file. The first three HDUs are just the first three image planes
A visit
+ccd
data ID is also used by the main catalog of single frame measurements, which is the src
dataset:
catalog = butler.get("src", dataId, immediate=True)
The catalog
object here is an instance of lsst.afw.table.SourceCatalog
. We'll return to it later in much greater detail.
This isn't an exhaustive list of all data products, but it has the ones you'll use most of the time. You can look in obs_subaru/policy/HsfMapper.paf for that, but beware that there's a lot of old things there too that are no longer used. While we won't cover most of these in the tutorial, the Name and Data ID Keys columns give you everything you need to know to get these from the Butler, and you'll note that most of them are either lsst.afw.image.ExposureF
and lsst.afw.table.SourceCatalog
, and that means most of the tutorials below will apply to them indirectly.
Name | Data ID Keys | Type | Produced By | Description |
---|---|---|---|---|
calexp | visit, ccd | lsst.afw.image.ExposureF | processCcd.py/reduceFrames.py | background-subtracted exposure, with Wcs, Psf, Calib, etc. |
calexpBackground | visit, ccd | lsst.afw.math.BackgroundList | processCcd.py/reduceFrames.py | background subtracted from calexp |
src | visit, ccd | lsst.afw.table.SourceCatalog | processCcd.py/reduceFrames.py | main single-frame catalog |
icSrc | visit, ccd | lsst.afw.table.SourceCatalog | processCcd.py/reduceFrames.py | bright sources used for initial single-frame calibration |
forced_sr | visit, ccd, tract | lsst.afw.table.SourceCatalog | forcedPhotCcd.py | forced photometry, using coadd positions and shapes |
calibrated_src | visit, ccd, tract | lsst.afw.table.SourceCatalog | calibrateCatalog.py | über-calibrated src |
calibrated_exp | visit, ccd, tract | lsst.afw.image.ExposureF | calibrateExposure.py | über-calibrated calexp |
deepCoadd_calexp | tract, patch, filter | lsst.afw.image.ExposureF | processCoadd.py/stack.py | background-subtracted coadd, with Wcs, Psf, Calib, ApCorr, etc. |
deepCoadd_calexpBackground | tract, patch, filter | lsst.afw.math.BackgroundList | processCoadd.py/stack.py | background subtracted from deepCoadd_calexp |
deepCoadd_src | tract, patch, filter | lsst.afw.table.SourceCatalog | processCoadd.py/stack.py | main coadd catalog, done separately in each band |
deepCoadd_forced_src | tract, patch, filter | lsst.afw.table.SourceCatalog | forcedPhotCoadd.py | forced photometry, using one band as reference for centroids, shapes |
The most important thing you can do to understand what's in a catalog is to look at its schema:
print catalog.schema
A catalog's schema is essentially a list of its columns (called Fields). In addition to its name and type, each field also typically has some documentation and units (if appropriate). Most of the time, you'll want to access the data column-by-columnn, which you can do by just calling catalog.get()
with the name of the field. That returns a NumPy array, so you can use it directly for plotting. For instance, he's a plot of x position on the chip vs. PSF flux:
from matplotlib import pyplot
%matplotlib inline
pyplot.scatter(catalog.get("centroid.sdss.x"), catalog.get("flux.psf"), alpha=0.2)
pyplot.ylim(0, 5000)
If you're paying very close attention, you may have noticed that while "centroid.sdss
" was in the schema we printed, "centroid.sdss.x
" was not. That's because "centroid.sdss
" has type "PointD
", which is a compound field. You can't actually get columns for an entire compound field, because there's no NumPy type for them, but you can get their implicit subfields, which for points are just "x
" and "y
". You can find information on the field types in the reference documentation: http://hsca.ipmu.jp/doxygen/latest/afw_table.html.
If we try to make a histogram of the fluxes, however, we'll run into problems:
pyplot.hist(catalog.get("flux.psf"), bins=50)
That's because many of the PSF flux measurements failed, and hence the array we get back has a lot of NaNs, and matplotlib's histogram doesn't like NaNs. To filter those out, we need to filter the catalog, and to do that, we want to look at Flag
fields. Flag
fields look just like booleans - when you get a column of them, they just return a bool array - but within the catalog each Flag
value only takes up a single bit, so you don't need to worry about the fact that we have a lot of them. Most measurement algorithms have several flags, to indicate failure modes. Let's look at the flags for the shape.sdss
algorithm:
for k, v in catalog.schema.extract("shape.sdss.flags*").iteritems():
print "%s: %s" % (k, v.field.getDoc())
The extract
method is a way to get a view of just small piece of a Schema
, using a glob pattern (it can also use regular expressions). It returns a dict
of field names to SchemaItem
s, which are just structs that hold a Field
and a Key
. The Field
contains all the descriptive information, like the getDoc()
method we use here. We'll return to Key
objects later.
You can also call extract
on a catalog, to get a dict
of names and column arrays:
catalog.extract("flux.psf*")
You may have noticed that both shape.sdss
and flux.psf
have one or more fields starting with *.flags
, with one Flag
field named just <algorithm>.flags
. That Flag
field is essentially a combination of all the ones below it that represent a particular problem with the outputs (not all flags do), and an algorithm should never return NaN unless it also sets that "general failure" flag field. (Sometimes an algorithm will still produce a result, even though it does set the general failure flag, but this usually means the result shouldn't be trusted).
To get back to the problem of making a histogram of PSF fluxes, we now know we can use the "flux.psf.flags
" field as a filter. To apply that filter, we simply index the catalog with a bool array that's True
for all the objects we want to keep (just as we would index a NumPy array). Because the Flag
field is actually False
for all the objects, we want to keep, we need to invert it first:
import numpy
good = catalog[numpy.logical_not(catalog.get("flux.psf.flags"))]
pyplot.hist(good.get("flux.psf"), bins=50)
The reason things didn't work this time is that when we slice the catalog, the new catalog we get back is a view into the old one. And that means that the rows (i.e. records) of the catalog are not contiguous in memory. The columns we get back as NumPy arrays are also views into the catalog, and NumPy's data model requires that those be contiguous. So to get columns from a catalog, we need that catalog to be contiguous. Happily, that's easy - we just deep-copy that catalog after we slice it:
good = catalog[numpy.logical_not(catalog.get("flux.psf.flags"))].copy(deep=True)
pyplot.hist(good.get("flux.psf"), bins=10)
We can also iterate over catalogs row-by-row instead instead of column-by-column. Here's a slower way of getting an array of unflagged PSF fluxes:
fluxes = []
for record in catalog:
if not record.get("flux.psf.flags"):
fluxes.append("flux.psf")
This is slow for two reasons: - When we get columns, we loop over all the rows in C++, which is much faster than looping over rows in Python. - Looking up a field by name is actually quite slow, and while we only do that once for all rows when we get an entire column, in the code above we do it separately for every row.
Sometimes you need to loop over every row in Python anyway, in order to perform some computation. When you do that, you can avoid the second problem by using Key
objects:
flagKey = catalog.schema.find("flux.psf.flags").key
fluxKey = catalog.schema.find("flux.psf").key
fluxes = []
for record in catalog:
if not record.get(flagKey):
fluxes.append(fluxKey)
Using a Key
object to get a value from a record is much faster than using a string, and you can use a Key
anywhere you could use a string in field lookup (including columns, though usually it's no faster there, because you only use the Key
once).
You can also get individual records that correspond to a particular source by their position in the catalog:
record = catalog[52]
Or by searching by Source ID (note that the ID is not the same as the position in the catalog!):
record = catalog.find(1063605701181802)
One thing you can do with records that you can't do with full catalogs is get compound fields. For instance, now we can get the full "centroid.sdss
" field, as an lsst.afw.geom.Point2D
:
point = record.get("centroid.sdss")
print point.getX(), point.getY()
Instead of using strings or keys, you can also get many fields using a system called "slots". Not only are these easy to use, they make it so you do not have to know the name of the algorithm we used to produce a measurement in order to get its result.
For example, we can also get the centroid like this:
point = record.getCentroid()
This is the same point
we had before, because getCentroid
is just an alias to "centroid.sdss
". When we run the pipeline, we can configure which of several centroid algorithms is used by getCentroid
. Because it returns a complex object, you cannot use getCentroid
on catalogs, but you can use getX()
and getY()
on both catalogs and records to get the coordinates of getCentroid()
separately:
assert record.getX() == record.get("centroid.sdss.x")
assert (catalog.getX() == catalog.get("centroid.sdss.x")).all()
In addition to the centroid, we have the following slots: - getShape()
(also getIxx()
, getIyy()
, getIxy()
) - moments of the object, usually an alias to "shape.sdss
". - getPsfFlux()
- a flux using a fit of the PSF model, usually an alias to "flux.psf
". - getApFlux()
- an aperture flux at a particular radius considered useful for the pipeline - getModelFlux()
- a flux using some sort of model fit. In the coadd processing, this is cmodel.flux
by default, but it is flux.gaussian
by default on single frames, because the CModel flux is very slow. - getInstFlux()
- means "instrumental flux", but it mostly exists for historical reasons
All of these slots also have uncertainty and flag getters (e.g. getPsfFluxErr()
->"flux.psf.err
", getPsfFluxFlag()
->"flux.psf.flags
").
When you can use one of these slots, it is probably better than using a string name, because it will keep your code from breaking if we change the names of algorithms in the future (it is much less likely that the slots will change), or decide to use a different algorithm in one of the slots (for instance, centroid.sdss
is our best centroid measurement right now, but we may have a better one in the future).
Because columns are views into the catalog, not copies, if you modify them using the usual NumPy commands, you modify the catalog itself:
psfFlux = catalog.getPsfFlux()
psfFlux[2:4] = 0.0
An important exception is Flag
fields: because these are stored as individual bits, getting a Flag column does make a copy, so modifying them does not modify the catalog. To modify a Flag, you have to use a record object.
To modify records, you use the set
method:
record.set("flux.psf.flags", True)
If you want to add a record to a catalog, use addNew()
:
newRecord = catalog.addNew()
You can then modify that record using set
, but be warned: adding a record can break the "contiguousness" of the catalog's memory, making it impossible to get columns from the catalog (which you can fix by making a deep copy, as before). It won't always break the contiguousness - it depends on whether the catalog needs to allocate more memory to make space for the new record. There are many options to control the memory management of a catalog object, but that is beyond the scope of this tutorial.
To avoid doing that now (and breaking the examples below), we'll delete the record we just created:
del catalog[-1]
You can also add a single existing record to a catalog with record.append
, or combine multiple catalogs together record.extend
. We won't cover those further today, except to mention that all records in a catalog must have the same Schema, so you must make sure this is the case before adding more records to a catalog.
Adding columns to a catalog is much more difficult. In fact, the memory model of the catalog makes this impossible - instead, you have to create a new catalog with both the new columns and the old columns in the Schema, and then copy the columns from the old catalog to the new catalog. There is an object, lsst.afw.table.SchemaMapper
, to help with this, and some other documentation exists to show how to use it for this purpose: http://hsca.ipmu.jp:8080/question/289/how-do-i-add-columns-to-an-existing-catalog/.
Usually when looking at flux measurements, we actually want to plot magnitudes instead of raw fluxes. To convert between them, we use a Calib
object, which we can get from the exposure
object we loaded back at the beginning:
calib = exposure.getCalib()
calib.setThrowOnNegativeFlux(False) # don't raise an exception when we encounter a negative or NaN flux
psfMag = calib.getMagnitude(catalog.getPsfFlux())
A Calib
can also convert flux errors to magnitude errors at the same time:
psfMag, psfMagErr = calib.getMagnitude(catalog.getPsfFlux(), catalog.getPsfFluxErr())
Note that you can also use calib.getMagnitude(...)
on a record object, where it will of course return single values instead of arrays.
If you want to transform (x, y) point measurements to (ra, dec) measurements, you probably don't need to do anything - for convenience, records and catalogs already have getRA()
and getDec
() methods, and records have a getCoord()
method as well. But let's imagine that the Wcs
object on our exposure
is better than it was during the measurement, and we want to transform the points again. Here's how to do that:
wcs = exposure.getWcs()
coord = wcs.pixelToSky(record.getCentroid())
print coord
This sort of calibration can only be done on one record at a time, and you can see that the pixelToSky(...)
method returns a lsst.afw.coord.Coord
object, not a Point2D
. A Coord
knows what coordinate system it is in (ICRS vs. galactic, etc.), and it lets you convert between different angle units, or different coordinate systems:
print coord.getPosition(lsst.afw.geom.degrees)
print coord.toEcliptic().getPosition(lsst.afw.geom.degrees)
Converting shapes (ellipses) is a little trickier. We need to start by approximating the Wcs
at the position of the object as an affine transformation:
affine = wcs.linearizePixelToSky(record.getCoord(), lsst.afw.geom.arcseconds)
moments = record.getShape().transform(affine.getLinear())
print moments
Even though the parameters are still called (ixx, iyy, ixy), these are actually in arcseconds^2, and the parameters are defined with (x=ra, y=dec). We could have used degrees
, radians
, or arcminutes
instead. We can also convert this ellipse into the (a, b, theta) parametrization, or an (e1, e2, r) parametrization as well. We'll do this after converting to celestial coordinates, but we could have done it before:
axes = lsst.afw.geom.ellipses.Axes(moments)
print axes.getA(), axes.getB(), axes.getTheta()
separable = lsst.afw.geom.ellipses.SeparableDistortionDeterminantRadius(moments)
print separable.getE1(), separable.getE2(), separable.getRadius()
There are many other Separable
classes in lsst.afw.geom.ellipses
, as there are many definitions of ellipticity and radius.
If you have run the mosaic
pipeline to calibrate many visits together, improved calibrates are saved on disk, but these are not immediately applied to the "calexp
" and "src
" datasets. This means that when you call exposure.getWcs()
, you get the original single-frame-astrometry WCS, not the improved mosaic WCS. Reading and applying the improved astrometry and photometry is difficult, so we have written some utilities to do it for you.
The easiest way to use these utilities is to run the calibrateCatalog.py
and catalogExposure.py
command-line scripts. These create new butler datasets, "calibrated_src
" and "calibrated_exp
". If you simply load those instead of "src
" and "calexp
", you will get the improved calibrations automatically.