Skip to main content

Summary of Informatics Team Projects


In the Informatics team, we provide essential support for informatics projects that involve enriching or creating digital records, supported by guidance, funding, and in-kind resources from the DPO. Our approach focuses on scalability, automation, interconnectivity, innovation, machine learning, open-source software, and reusable solutions. We prioritize workflows capable of handling large volumes of records, automating tedious tasks, and establishing seamless data transfer between systems. Research into cutting-edge tools and technologies, including AI, allows us to enhance images and records. We promote transparency through open-source software and develop adaptable solutions for broader applications across projects and institutions.


UnitTitleStatusRepositoryDatesRecords Created or EnhancedMore Info
CHSDMUpdate CIS from Data in Catalog Card
Using the data from the transcription of the catalog cards, we are updating the fields without data in the records in TMS.
ongoingNAMay 2024 -NANA
NMNHReplace Image EXIF Metadata
The metadata in the HSFA Mass Digitization project contained non-ascii characters in a subset of the images. We replaced the data with the correct values, regenerated the MD5 file and delivered the files to DAMS.
ongoingNAApr 2024 -11,307NA
NMAAImage Deduplication between DAMS and Network Share
We are going to match images in the DAMS and a Network Share to keep only the highest resolution images. In addition, we will help the unit to ingest the rest of the images into DAMS and into ASpace.
ongoingNAFeb 2024 -NANA
SISI Thesaurus Reconciliation Service
An SI-wide reconciliation service for OpenRefine that allows to reconcile against terms in the SI Thesaurus as well as other data sources. This includes data sources that do not have reconciliation services, like SI Open Access, LoC, GBIF, and others.
ongoingRepositoryDec 2023 -NANA
OCIOOsprey on Hydra
Running the Osprey Worker script on the Hydra High Performance Cluster. This allows us to scale processing as needed.
ongoingRepositoryMar 2023 -48,980NA
NMAAHCMass Digitization Pilot Project of the Johnson Publishing Company Archive
After the 25,050 images passed QC, we delivered the images to DAMS via a hotfolder. Then, we created IDs for all archival items and we use those IDs to create 9,409 stub records in Arches and save 50,100 links between IDs in Getty's ID Manager.
ongoingRepositoryAug 2022 -9,409NA
OCIOOsprey Dashboard
System that receives images from vendors, checks that they meet the project requirements, and displays the results on a dashboard of Collections Digitization projects in DPO. Coded in Python.
ongoingRepositoryJul 2022 -182,008NA
SISI Thesaurus
An SI-wide system to host thesauri, controlled vocabularies, taxonomies, and other lists generated by the SI units. Records are number of terms in the database.
ongoingNASep 2021 -1,481SharePoint
CHSDMCooper Hewitt Card Catalog Transcription
Digitization and transcription of the catalog cards of the collection of the museum. The digitization vendor is using Virtual Barcodes to link the item ID to the database ID.
ongoingRepositoryApr 2021 -56,396NA
NMNHTracking Scientific Names in Digitization of Bees
We used Virtual Barcodes to encode the IRN of the scientific name (from EMu's taxonomy) in the image metadata. The IRN was extracted to CSV files to populate the database, which avoided hard-coding the species name or IRN to the image.
completedRepositoryDec 2019 - Apr 202030,020NA
NMAHVirtual Barcodes for Digitization of Numismatics Collections
Each object has a record so we used Virtual Barcodes to name the files using the unique database key value (MKEY) for easy matching of the images and the record.
completedRepositoryOct 2019 - Feb 202025,204NA
NMAfAArchives Stephen Grant Postcard Collection
Both sides of the postcards were stitched together in a single image.
completedRepositoryOct 2019 - Mar 20207,410NA
OCIOMass Digi Dashboard (ver. up to 1.6)
Original dashboard used to track Mass Digitization projects in DPO. Dashboard was coded in R/Shiny. Replaced by Osprey.
completedNAJan 2019 - Jul 2023240,000*NA
NMAHPrinceton Posters Mass Digitization
The mass digitization project needed to assign the unique database key value (MKEY) to the captured images since the objects already had item-level records. The Virtual Barcodes system allowed the vendor to search for the item and assign the filename if the item was found in the database.
completedRepositoryDec 2018 - May 201917,976

SOVA
Collections Search

OCIOShiny Application Servers
We are managing the internal and external R/Shiny servers. These allow the publication of web applications written entirely in R using the Shiny package.
ongoingRepositoryJun 2018 -NAConfluence

* Value was estimated

Some projects may touch the same records, so the total above will be less than the sum of all projects.

Small projects (e.g. simple file edits, data transfer, small data fixes) are not included in the table above.


Software We Have Published

SoftwareDetailsRepositoryDetails
SIT ReconcileCustomized reconciliation service for OpenRefine to allow reconciliation against sources that do not support it.Github LogoMore Info
OspreyA verification system for digital files and associated dashboard to display the results.Github LogoMore Info
MD5 ToolA command line and graphical tool to generate text files with the MD5 hash of files in a folder. Used for verification in DAMS ingestion and other processes.Github LogoMore Info