sdss · albireox · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/mos_target_fits.ipynb b/mos_target_fits.ipynb
@@ -0,0 +1,397 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "bed3cbcb",
+   "metadata": {},
+   "source": [
+    "# Working with MOS target files\n",
+    "\n",
+    "## Learning goals\n",
+    "\n",
+    "The goal of this tutorial is to introduce you to the MOS target files, which contain information about the targets observed with the SDSS-V Multi-Object Spectrograph (MOS) spectrographs (i.e., APOGEE and BOSS). In particular, this tutorial uses the FITS version of the MOS target files. By the end of this tutorial, you should be able to:\n",
+    "\n",
+    "- Understand what the MOS target files are and what information they contain.\n",
+    "- Access the MOS target files using the `sdss-access` package.\n",
+    "- Read the MOS target FITS files using `astropy`, including how to handle catalogues that have been partitioned into multiple files.\n",
+    "\n",
+    "## What are the MOS target files?\n",
+    "\n",
+    "SDSS-V target selection is a complex process that involves cross-matching multiple input catalogues and then applying various sets of selection criteria to identify targets that are suitable for observation with the SDSS-V MOS spectrographs. Each set of selection criteria is referred to as a \"carton\" and it defines a target class of objects that are of interest for a specific SDSS-V science program (e.g., RM, AQMES, Galactic Genesis, etc.) The input catalogues, results of the cross-match, and carton outputs, are stored in a PostgreSQL database at University of Utah. The contents of the database are also accessible to the public through the Catalog Archive Server (CAS) for each data release. More information about the SDSS-V target selection process can be found [here](https://www.sdss.org/dr20/targeting/).\n",
+    "\n",
+    "The MOS target files contain the same information as the target selection database and are named for each one of the tables in the database. For example, the `catalogdb.catalog` table (`mos_catalog` in CAS) contains the `catalogid` identifiers for each unique target for a specific cross-match. This table is released as the [mos_target_catalog](https://data.sdss.org/datamodel/files/MOS_TARGET/V_TARG/mos_target_catalog.html) file product. The list of all MOS target products can be found in a table at the bottom of [this page](https://sdss.org/dr20/targeting/cross-match/).\n",
+    "\n",
+    "For each table/MOS target product we provide two types of files: binary FITS tables and [Parquet files](https://parquet.apache.org). We recommend using [Parquet files](#using-parquet-files) as they are smaller, more efficient to read, and they can be processed in such a way that you can handle data that is larger than your computer's memory. To bypass that problem, the FITS files for large tables are partitioned into multiple files, which can be read one at a time. This tutorial uses the FITS files. We recommend reading the `mos_target_parquet.ipynb` notebook first, which includes examples of how to read the Parquet files.\n",
+    "\n",
+    "## Dependencies\n",
+    "\n",
+    "This tutorial makes use of the following Python packages:\n",
+    "\n",
+    "- `sdss-access` to access SDSS data.\n",
+    "- `astropy` to read FITS files.\n",
+    "\n",
+    "These packages can be installed using `pip`:\n",
+    "\n",
+    "```bash\n",
+    "pip install sdss_access astropy\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42cc84b2",
+   "metadata": {},
+   "source": [
+    "## Which files should I use?\n",
+    "\n",
+    "These are our recommendations when querying the MOS targeting data:\n",
+    "\n",
+    "- If you can, use [CASJobs](https://casjobs.sdss.org/) to query the `mos_` tables. Those tables have indices that will make most queries faster, and you don't need to worry about downloading large files or making sure your computer has enough memory.\n",
+    "- If CASJobs is not an option, use the Parquet files and `polars` to read them. With Parquet files you don't need to worry about chunking, and `polars` will allow you to perform operations on the data without having to load it all in memory.\n",
+    "- Use FITS files only as a last resource if you are using a programming language that doesn't have good support for Parquet files (e.g., IDL) or if you want to use the files with a software that only accepts FITS files.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f6a9b54",
+   "metadata": {},
+   "source": [
+    "## Using FITS files\n",
+    "\n",
+    "In this section we will show how to work with the MOS target FITS files. As mentioned above, we do not recommend using these files unless you are using a programming language that doesn't have good support for Parquet files (in which case a Python tutorial may not be that useful!)\n",
+    "\n",
+    "The MOS target FITS files contain the same information as the Parquet files in the form of binary FITS tables. However, to ensure that each file can be read in memory, large tables are partitioned into multiple files. For example, the `mos_target_sdss_id_flat` table is partitioned in 140 files, `mos_sdss_id_flat-001.fits` to `mos_sdss_id_flat-140.fits`. Let's write a function to retrieve the file paths. FITS files are also less efficient storing data than Parquet, so the total size of the FITS files for this example is significantly larger than the Parquet files.\n",
+    "\n",
+    "<div class=\"alert alert-block alert-warning\">\n",
+    "<b>Warning:</b> This code expects the code to exist locally and will not download any data. We recommend running this notebook in SciServer. If you are sure that you want to download the data locally, set the `download` parameter to `True`.\n",
+    "</div>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "db75e5b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sdss_access import Access, Path\n",
+    "\n",
+    "import pathlib\n",
+    "\n",
+    "# For DR20, the targeting version is 2.0.0.\n",
+    "V_TARG = \"2.0.0\"\n",
+    "\n",
+    "# If necessary, set the SAS_BASE_PATH environment variable.\n",
+    "# os.environ[\"SAS_BASE_DIR\"] = \"/uufs/chpc.utah.edu/common/home/sdss50/\"\n",
+    "\n",
+    "\n",
+    "def get_mos_target_fits_path(mos_target_product, download=False):\n",
+    "    \"\"\"Get the path to the FITS file for a given MOS target product, downloading it if it doesn't exist.\"\"\"\n",
+    "\n",
+    "    path = Path(release=\"DR20\")\n",
+    "\n",
+    "    # Get the local SAS path to the FITS file for the specified MOS target product.\n",
+    "    mos_target_path = path.full(mos_target_product, v_targ=V_TARG, ftype=\"fits\", num=\"*\")\n",
+    "\n",
+    "    # If the file contains a * it means it's partitioned. We do a bit of globbing to get all the files.\n",
+    "    if \"*\" in mos_target_path:\n",
+    "        mos_target_paths = pathlib.Path(mos_target_path).parent.glob(pathlib.Path(mos_target_path).name)\n",
+    "        mos_target_paths = list(sorted(map(str, mos_target_paths)))\n",
+    "    else:\n",
+    "        mos_target_paths = [mos_target_path]\n",
+    "\n",
+    "    # Check that all the files exist.\n",
+    "    if not all(pathlib.Path(file).exists() for file in mos_target_paths):\n",
+    "        if download:\n",
+    "            access = Access(release=\"DR20\")\n",
+    "            access.remote()\n",
+    "            access.add(mos_target_product, v_targ=V_TARG, ftype=\"fits\", num=\"*\")\n",
+    "            access.set_stream()\n",
+    "            access.commit()\n",
+    "        else:\n",
+    "            raise FileNotFoundError(f\"One or more files for {mos_target_product!r} do not exist locally.\")\n",
+    "\n",
+    "    return mos_target_paths"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0863891",
+   "metadata": {},
+   "source": [
+    "This looks pretty similar to the Parquet function but note that we are adding `num=\"*\"` to retrieve all the chunk files for each table. Let's see the result for the `mos_target_sdss_id_flat` and `mos_carton` tables. The first one will be a list of multiple files, with the second containing a single file because the `carton` table is small and doesn't need to be partitioned.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "83891e91",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[0;31m[ERROR]: \u001b[0mTraceback (most recent call last):\n",
+      "  File \u001b[36m\"/Users/gallegoj/Documents/Code/sdss5/dr20_tutorials/.venv/lib/python3.14/site-packages/IPython/core/interactiveshell.py\"\u001b[39;49;00m, line \u001b[34m3747\u001b[39;49;00m, in run_code\u001b[37m\u001b[39;49;00m\n",
+      "\u001b[37m    \u001b[39;49;00mexec(code_obj, \u001b[36mself\u001b[39;49;00m.user_global_ns, \u001b[36mself\u001b[39;49;00m.user_ns)\u001b[37m\u001b[39;49;00m\n",
+      "\u001b[37m    \u001b[39;49;00m~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\u001b[37m\u001b[39;49;00m\n",
+      "  File \u001b[36m\"/var/folders/5g/fgdzk0h51yl6kyrfwyl3wjgc0000gn/T/ipykernel_550/2199777847.py\"\u001b[39;49;00m, line \u001b[34m1\u001b[39;49;00m, in <module>\u001b[37m\u001b[39;49;00m\n",
+      "\u001b[37m    \u001b[39;49;00mmos_carton_files = get_mos_target_fits_path(\u001b[33m\"\u001b[39;49;00m\u001b[33mmos_target_carton\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, download=\u001b[34mFalse\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
+      "  File \u001b[36m\"/var/folders/5g/fgdzk0h51yl6kyrfwyl3wjgc0000gn/T/ipykernel_550/3596011898.py\"\u001b[39;49;00m, line \u001b[34m36\u001b[39;49;00m, in get_mos_target_fits_path\u001b[37m\u001b[39;49;00m\n",
+      "\u001b[37m    \u001b[39;49;00m\u001b[34mraise\u001b[39;49;00m \u001b[36mFileNotFoundError\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mOne or more files for \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mmos_target_product\u001b[33m!r}\u001b[39;49;00m\u001b[33m do not exist locally.\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n",
+      "\u001b[91mFileNotFoundError\u001b[39;49;00m: One or more files for 'mos_target_carton' do not exist locally.\u001b[37m\u001b[39;49;00m\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "mos_carton_files = get_mos_target_fits_path(\"mos_target_carton\", download=False)\n",
+    "mos_carton_to_target_files = get_mos_target_fits_path(\"mos_target_carton_to_target\", download=False)\n",
+    "mos_target_files = get_mos_target_fits_path(\"mos_target_target\", download=False)\n",
+    "mos_sdss_id_flat_files = get_mos_target_fits_path(\"mos_target_sdss_id_flat\", download=False)\n",
+    "mos_catalog_to_gaia_dr3_files = get_mos_target_fits_path(\"mos_target_catalog_to_gaia_dr3_source\", download=False)\n",
+    "mos_gaia_dr3_files = get_mos_target_fits_path(\"mos_target_gaia_dr3_source\", download=False)\n",
+    "\n",
+    "print(mos_sdss_id_flat_files)\n",
+    "print(mos_carton_files)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49385a79",
+   "metadata": {},
+   "source": [
+    "There are two main ways to handle the partitioned files: we can either read each of them sequentially and concatenate the results (e.g., with the `astropy.table` `vstack` function), or we can read them one at a time and perform the necessary operations on each file before moving to the next one, saving the results we are interested in. The first approach is a bit less convoluted but it may exceed the available memory if the resulting table is too large. Here we will take the second approach.\n",
+    "\n",
+    "We'll start by reading the `mos_carton` file and getting the `carton_pk` identifiers for the Galactic Genesis cartons.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "95871aed",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "       carton       carton_pk mapper_pk ... program  target_selection_plan\n",
+      "------------------- --------- --------- ... -------- ---------------------\n",
+      "      mwm_snc_100pc       126         0 ...  mwm_snc                 0.1.0\n",
+      "      mwm_snc_250pc       127         0 ...  mwm_snc                 0.1.0\n",
+      "       mwm_cb_300pc       128         0 ...   mwm_cb                 0.1.0\n",
+      "mwm_cb_cvcandidates       134         0 ...   mwm_cb                 0.1.0\n",
+      "        mwm_halo_sm       140         0 ... mwm_halo                 0.1.0\n",
+      "        mwm_halo_bb       143         0 ... mwm_halo                 0.1.0\n",
+      "         mwm_yso_s1       144         0 ...  mwm_yso                 0.1.0\n",
+      "         mwm_yso_s2       145         0 ...  mwm_yso                 0.1.0\n",
+      "       mwm_yso_s2-5       146         0 ...  mwm_yso                 0.1.0\n",
+      "         mwm_yso_s3       147         0 ...  mwm_yso                 0.1.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy\n",
+    "from astropy.table import Table\n",
+    "\n",
+    "# Read the carton table and print a sample.\n",
+    "carton_table = Table.read(mos_carton_files[0], format=\"fits\")\n",
+    "print(carton_table[:10])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "868405a5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Carton PKs: [544, 1628, 1810, 1811]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# This file species is a single file, so we can do normal slicing on it.\n",
+    "# But we want to get any carton that starts with \"mwm_galactic_core\", which\n",
+    "# is not totally trivial. Since the table is small, we will simply loop over the rows.\n",
+    "carton_pks = []\n",
+    "for row in carton_table:\n",
+    "    if row[\"carton\"].startswith(\"mwm_galactic_core\"):\n",
+    "        carton_pks.append(int(row[\"carton_pk\"]))\n",
+    "\n",
+    "print(f\"Carton PKs: {carton_pks}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "9e2722c7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "target_pk\n",
+      "---------\n",
+      "246626552\n",
+      "246628288\n",
+      "246629026\n",
+      "246650074\n",
+      "246650089\n",
+      "246650110\n",
+      "246650860\n",
+      "246655241\n",
+      "246655333\n",
+      "246655420\n",
+      "      ...\n",
+      "365440161\n",
+      "365440162\n",
+      "365440163\n",
+      "365440164\n",
+      "365440165\n",
+      "365440166\n",
+      "365440167\n",
+      "365440168\n",
+      "365440169\n",
+      "365440170\n",
+      "Length = 11461373 rows\n"
+     ]
+    }
+   ],
+   "source": [
+    "from astropy.table import vstack, unique\n",
+    "\n",
+    "# We will now do the same with the files in the mos_target_carton_to_target table/\n",
+    "# Here we iterate over each chunk file, get the rows that match the carton PKs,\n",
+    "# and keep only the target_pk column.\n",
+    "target_pk_chunks = []\n",
+    "for carton_to_target_file in mos_carton_to_target_files:\n",
+    "    carton_to_target_table = Table.read(carton_to_target_file, format=\"fits\")\n",
+    "\n",
+    "    # Get the rows that match the carton PKs and keep only the target_pk column.\n",
+    "    matching_rows = carton_to_target_table[numpy.isin(carton_to_target_table[\"carton_pk\"], carton_pks)]\n",
+    "    target_pks = matching_rows[\"target_pk\"]\n",
+    "\n",
+    "    # Add these partial results to the list of target PK chunks.\n",
+    "    target_pk_chunks.append(target_pks)\n",
+    "\n",
+    "# Concatenate the results.\n",
+    "carton_to_target_gal_gen = vstack(target_pk_chunks)\n",
+    "\n",
+    "# Many target_pks will be repeated, so we get the unique values.\n",
+    "carton_to_target_gal_gen = unique(carton_to_target_gal_gen)\n",
+    "\n",
+    "# Print results\n",
+    "print(carton_to_target_gal_gen)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2cbdc011",
+   "metadata": {},
+   "source": [
+    "We will do the same to get the `catalogids` and the `sdss_ids`.\n",
+    "\n",
+    "<div class=\"alert alert-block alert-warning\">\n",
+    "<b>Warning:</b> The following cell may take over an hour to run! You can uncomment the `if len(...)` lines to test the code with a smaller number of files. Note that this will not give you the full results, but it will allow you to check that the code is working as expected without having to wait for all files to be processed.\n",
+    "</div>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "41c6fee6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from astropy.table import join\n",
+    "\n",
+    "# Get the catalogids from the mos_target table\n",
+    "catalogid_chunks = []\n",
+    "for target_file in mos_target_files:\n",
+    "    target_table = Table.read(target_file, format=\"fits\")\n",
+    "\n",
+    "    # Get the rows that match the target PKs and keep only the catalogid column.\n",
+    "    # We will us the join function from astropy.table which is likely to be more\n",
+    "    # efficient.\n",
+    "    matching_rows = join(\n",
+    "        target_table,\n",
+    "        carton_to_target_gal_gen,\n",
+    "        keys_left=[\"target_pk\"],\n",
+    "        keys_right=[\"target_pk\"],\n",
+    "        join_type=\"inner\",\n",
+    "    )\n",
+    "    catalogids = matching_rows[\"catalogid\"]\n",
+    "\n",
+    "    # Add these partial results to the list of catalogid chunks.\n",
+    "    if len(catalogids) > 0:\n",
+    "        catalogid_chunks.append(catalogids)\n",
+    "\n",
+    "    # Uncomment the following lines if you want to stop after the\n",
+    "    # first file with results (for testing purposes).\n",
+    "    # if len(catalogid_chunks) > 0:\n",
+    "    #     break\n",
+    "\n",
+    "# Concatenate the results and get only unique values.\n",
+    "catalogids = unique(vstack(catalogid_chunks))\n",
+    "\n",
+    "# Now do the same to get the associated sdss_ids from the mos_target_sdss_id_flat table.\n",
+    "sdss_id_chunks = []\n",
+    "for sdss_id_file in mos_sdss_id_flat_files:\n",
+    "    sdss_id_table = Table.read(sdss_id_file, format=\"fits\")\n",
+    "\n",
+    "    matching_rows = sdss_id_table[numpy.isin(sdss_id_table[\"catalogid\"], catalogids[\"catalogid\"])]\n",
+    "    sdss_ids = matching_rows[[\"sdss_id\", \"catalogid\"]]\n",
+    "\n",
+    "    if len(sdss_ids) > 0:\n",
+    "        sdss_id_chunks.append(sdss_ids)\n",
+    "\n",
+    "    # Uncomment the following lines if you want to stop after the\n",
+    "    # first file with results (for testing purposes).\n",
+    "    # if len(sdss_id_chunks) > 0:\n",
+    "    #     break\n",
+    "\n",
+    "sdss_ids = unique(vstack(sdss_id_chunks))\n",
+    "\n",
+    "print(sdss_ids)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56e3fe4b",
+   "metadata": {},
+   "source": [
+    "We will not show how to get the Galactic coordinates for these targets using the Gaia DR3 tables, but the process would be similar to what we just did but using the `mos_target_gaia_dr3_source` and `mos_target_catalog_to_gaia_dr3_source`.\n",
+    "\n",
+    "One important thing to note is the very different performance of the operations with FITS files compared to Parquet files. The operation we just performed using FITS files (getting all the `sdss_ids` associated with the Galactic Genesis cartons) took over two hours while the same operations with Parquet files took approximately 30 seconds.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "dr20_tutorials",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.14.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}