Working with remote data using DuckDB

Exploring the VIRION database

Last week at the GEOBON meeting, Carl Boettiger gave me a very quick demo of DuckDB, which is a really cool project I hadn’t paid enough attention to yet. In this post, I will show how to use it to work remotely on the VIRION database, which we curate with Verena.

To make it all work, we need the DuckDB package, and the usual data frame paackages for Julia:

using DataFrames
using DuckDB
using PrettyTables

The nice thing with DuckDB is that it is possible to work in-memory, so we will establish a new connection of this type:

db = DBInterface.connect(DuckDB.DB, ":memory:")

We can load data that are stored remotely, which only requires to add the http file-system to our database connection:

DBInterface.execute(db, "INSTALL httpfs;")
DBInterface.execute(db, "LOAD httpfs;")

This is useful when we don’t feel like downloading the data. If the data are expected to change rapidly, this is a considerable advantage: we can ensure that we work on the most recent version.

The VIRION data are stored on GitHub, and we will focus on a simple example where we pick the taxonomy of hosts and viruses, and the egdelist between these two:

hosts = "https://raw.githubusercontent.com/viralemergence/virion/main/Virion/TaxonomyHost.csv"
viruses = "https://raw.githubusercontent.com/viralemergence/virion/main/Virion/TaxonomyVirus.csv"
edges = "https://raw.githubusercontent.com/viralemergence/virion/main/Virion/Edgelist.csv"

In order to add these tables to the connection, we CREATE a new view using the URL as the table name:

DBInterface.execute(db, "CREATE OR REPLACE TEMPORARY VIEW hosts AS SELECT * FROM read_csv_auto('$(hosts)', nullstr='NA');")
DBInterface.execute(db, "CREATE OR REPLACE TEMPORARY VIEW viruses AS SELECT * FROM read_csv_auto('$(viruses)', nullstr='NA');")
DBInterface.execute(db, "CREATE OR REPLACE TEMPORARY VIEW edges AS SELECT * FROM read_csv_auto('$(edges)', nullstr='NA');")

And we’re good to go. It’s not a bad idea to make sure that the tables have been loaded correctly, so let’s take a look at the content of the host taxonomy:

DataFrame(DBInterface.execute(db, "DESCRIBE hosts;")) |>  pretty
column_name column_type null key default extra
HostTaxID BIGINT YES
Host VARCHAR YES
HostGenus VARCHAR YES
HostFamily VARCHAR YES
HostOrder VARCHAR YES
HostClass VARCHAR YES
HostNCBIResolved BOOLEAN YES

We can start running a request like “the number of species in the 10 genus of bats for which we have the most species mapped onto the NCBI taxonomy”:

req = """
SELECT DISTINCT HostGenus AS Genus, COUNT(HostGenus) as Richness
    FROM hosts
    WHERE
        (HostOrder LIKE 'chiroptera') AND
        (HostNCBIResolved == true)
    GROUP BY HostGenus
    ORDER BY count(HostGenus) DESC
    LIMIT 10;
"""
unique_bats = DBInterface.execute(db, req)
DataFrame(unique_bats) |> pretty
Genus Richness
myotis 56
rhinolophus 40
hipposideros 22
miniopterus 19
pteropus 19
eptesicus 12
pipistrellus 10
artibeus 10
molossus 8
scotophilus 7

Nice. Now, we can start looking at joining the different tables, to get a sense of e.g. what viruses of non-human primates are in VIRION. Again, this will be a simple matter of writing the SQL query:

req = """
CREATE VIEW hostedges
AS
SELECT HostGenus, Host, VirusClass, VirusOrder, VirusFamily, VirusGenus, Virus FROM hosts
LEFT JOIN edges
ON edges.HostTaxID = hosts.HostTaxID
RIGHT JOIN viruses
ON edges.VirusTaxID = viruses.VirusTaxID
WHERE
    (hosts.HostOrder LIKE 'primates') AND
    (hosts.Host NOT LIKE 'homo sapiens') AND
    (hosts.HostNCBIResolved = true) AND
    (viruses.VirusNCBIResolved = true);
"""
make_view = DBInterface.execute(db, req)
get_primates_viruses = DBInterface.execute(db, "SELECT * FROM hostedges LIMIT 10;")
DataFrame(get_primates_viruses) |> pretty
HostGenus Host VirusClass VirusOrder VirusFamily VirusGenus Virus
callithrix callithrix jacchus flasuviricetes amarillovirales flaviviridae hepacivirus marmoset pegivirus
callithrix callithrix jacchus flasuviricetes amarillovirales flaviviridae pegivirus southwest bike trail virus
leontocebus leontocebus fuscicollis pokkesviricetes chitovirales poxviridae orthopoxvirus cowpox virus
leontocebus leontocebus fuscicollis ellioviricetes bunyavirales arenaviridae mammarenavirus lymphocytic choriomeningitis mammarenavirus
saguinus saguinus mystax monjiviricetes mononegavirales paramyxoviridae morbillivirus measles morbillivirus
saguinus saguinus mystax flasuviricetes amarillovirales flaviviridae pegivirus pegivirus a
leontocebus leontocebus nigricollis ellioviricetes bunyavirales arenaviridae mammarenavirus lymphocytic choriomeningitis mammarenavirus
leontocebus leontocebus nigricollis flasuviricetes amarillovirales flaviviridae pegivirus pegivirus a
saguinus saguinus oedipus pokkesviricetes chitovirales poxviridae orthopoxvirus cowpox virus
saguinus saguinus oedipus monjiviricetes mononegavirales paramyxoviridae morbillivirus measles morbillivirus

With this new view, we can make a table listing the number of know viruses between the species of hosts and the families of viruses:

req = """
SELECT
    Host, VirusFamily, COUNT(Virus) AS Number
FROM
    hostedges
GROUP BY
    Host, VirusFamily
ORDER BY
    Number DESC
LIMIT 20;
"""
get_num_viruses = DBInterface.execute(db, req)
DataFrame(get_num_viruses) |> pretty
Host VirusFamily Number
macaca mulatta picobirnaviridae 320
macaca mulatta papillomaviridae 55
macaca mulatta adenoviridae 32
pan troglodytes picornaviridae 29
pan troglodytes adenoviridae 27
macaca mulatta parvoviridae 25
pan troglodytes herpesviridae 25
macaca mulatta picornaviridae 23
pan troglodytes polyomaviridae 18
macaca fascicularis herpesviridae 17
macaca mulatta astroviridae 17
gorilla gorilla adenoviridae 14
macaca mulatta herpesviridae 14
macaca mulatta retroviridae 13
chlorocebus aethiops herpesviridae 13
saimiri sciureus herpesviridae 13
chlorocebus aethiops paramyxoviridae 12
gorilla beringei adenoviridae 12
gorilla beringei herpesviridae 12
gorilla gorilla herpesviridae 12

That’s it! DuckDB is definitely really nice, and I am going to integrate it in my workflow.