Working with remote data using DuckDB
Exploring the VIRION databaseLast week at the GEOBON meeting, Carl Boettiger gave me a very quick demo of DuckDB, which is a really cool project I hadn’t paid enough attention to yet. In this post, I will show how to use it to work remotely on the VIRION database, which we curate with Verena.
To make it all work, we need the DuckDB
package, and the usual data frame paackages for Julia:
using DataFrames
using DuckDB
using PrettyTables
The nice thing with DuckDB is that it is possible to work in-memory, so we will establish a new connection of this type:
db = DBInterface.connect(DuckDB.DB, ":memory:")
We can load data that are stored remotely, which only requires to add the http file-system to our database connection:
DBInterface.execute(db, "INSTALL httpfs;")
DBInterface.execute(db, "LOAD httpfs;")
This is useful when we don’t feel like downloading the data. If the data are expected to change rapidly, this is a considerable advantage: we can ensure that we work on the most recent version.
The VIRION data are stored on GitHub, and we will focus on a simple example where we pick the taxonomy of hosts and viruses, and the egdelist between these two:
hosts = "https://raw.githubusercontent.com/viralemergence/virion/main/Virion/TaxonomyHost.csv"
viruses = "https://raw.githubusercontent.com/viralemergence/virion/main/Virion/TaxonomyVirus.csv"
edges = "https://raw.githubusercontent.com/viralemergence/virion/main/Virion/Edgelist.csv"
In order to add these tables to the connection, we CREATE
a new view using the URL as the table name:
DBInterface.execute(db, "CREATE OR REPLACE TEMPORARY VIEW hosts AS SELECT * FROM read_csv_auto('$(hosts)', nullstr='NA');")
DBInterface.execute(db, "CREATE OR REPLACE TEMPORARY VIEW viruses AS SELECT * FROM read_csv_auto('$(viruses)', nullstr='NA');")
DBInterface.execute(db, "CREATE OR REPLACE TEMPORARY VIEW edges AS SELECT * FROM read_csv_auto('$(edges)', nullstr='NA');")
And we’re good to go. It’s not a bad idea to make sure that the tables have been loaded correctly, so let’s take a look at the content of the host taxonomy:
DataFrame(DBInterface.execute(db, "DESCRIBE hosts;")) |> pretty
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
HostTaxID | BIGINT | YES | |||
Host | VARCHAR | YES | |||
HostGenus | VARCHAR | YES | |||
HostFamily | VARCHAR | YES | |||
HostOrder | VARCHAR | YES | |||
HostClass | VARCHAR | YES | |||
HostNCBIResolved | BOOLEAN | YES |
We can start running a request like “the number of species in the 10 genus of bats for which we have the most species mapped onto the NCBI taxonomy”:
req = """
SELECT DISTINCT HostGenus AS Genus, COUNT(HostGenus) as Richness
FROM hosts
WHERE
(HostOrder LIKE 'chiroptera') AND
(HostNCBIResolved == true)
GROUP BY HostGenus
ORDER BY count(HostGenus) DESC
LIMIT 10;
"""
unique_bats = DBInterface.execute(db, req)
DataFrame(unique_bats) |> pretty
Genus | Richness |
---|---|
myotis | 56 |
rhinolophus | 40 |
hipposideros | 22 |
miniopterus | 19 |
pteropus | 19 |
eptesicus | 12 |
pipistrellus | 10 |
artibeus | 10 |
molossus | 8 |
scotophilus | 7 |
Nice. Now, we can start looking at joining the different tables, to get a sense of e.g. what viruses of non-human primates are in VIRION. Again, this will be a simple matter of writing the SQL query:
req = """
CREATE VIEW hostedges
AS
SELECT HostGenus, Host, VirusClass, VirusOrder, VirusFamily, VirusGenus, Virus FROM hosts
LEFT JOIN edges
ON edges.HostTaxID = hosts.HostTaxID
RIGHT JOIN viruses
ON edges.VirusTaxID = viruses.VirusTaxID
WHERE
(hosts.HostOrder LIKE 'primates') AND
(hosts.Host NOT LIKE 'homo sapiens') AND
(hosts.HostNCBIResolved = true) AND
(viruses.VirusNCBIResolved = true);
"""
make_view = DBInterface.execute(db, req)
get_primates_viruses = DBInterface.execute(db, "SELECT * FROM hostedges LIMIT 10;")
DataFrame(get_primates_viruses) |> pretty
HostGenus | Host | VirusClass | VirusOrder | VirusFamily | VirusGenus | Virus |
---|---|---|---|---|---|---|
callithrix | callithrix jacchus | flasuviricetes | amarillovirales | flaviviridae | hepacivirus | marmoset pegivirus |
callithrix | callithrix jacchus | flasuviricetes | amarillovirales | flaviviridae | pegivirus | southwest bike trail virus |
leontocebus | leontocebus fuscicollis | pokkesviricetes | chitovirales | poxviridae | orthopoxvirus | cowpox virus |
leontocebus | leontocebus fuscicollis | ellioviricetes | bunyavirales | arenaviridae | mammarenavirus | lymphocytic choriomeningitis mammarenavirus |
saguinus | saguinus mystax | monjiviricetes | mononegavirales | paramyxoviridae | morbillivirus | measles morbillivirus |
saguinus | saguinus mystax | flasuviricetes | amarillovirales | flaviviridae | pegivirus | pegivirus a |
leontocebus | leontocebus nigricollis | ellioviricetes | bunyavirales | arenaviridae | mammarenavirus | lymphocytic choriomeningitis mammarenavirus |
leontocebus | leontocebus nigricollis | flasuviricetes | amarillovirales | flaviviridae | pegivirus | pegivirus a |
saguinus | saguinus oedipus | pokkesviricetes | chitovirales | poxviridae | orthopoxvirus | cowpox virus |
saguinus | saguinus oedipus | monjiviricetes | mononegavirales | paramyxoviridae | morbillivirus | measles morbillivirus |
With this new view, we can make a table listing the number of know viruses between the species of hosts and the families of viruses:
req = """
SELECT
Host, VirusFamily, COUNT(Virus) AS Number
FROM
hostedges
GROUP BY
Host, VirusFamily
ORDER BY
Number DESC
LIMIT 20;
"""
get_num_viruses = DBInterface.execute(db, req)
DataFrame(get_num_viruses) |> pretty
Host | VirusFamily | Number |
---|---|---|
macaca mulatta | picobirnaviridae | 320 |
macaca mulatta | papillomaviridae | 55 |
macaca mulatta | adenoviridae | 32 |
pan troglodytes | picornaviridae | 29 |
pan troglodytes | adenoviridae | 27 |
macaca mulatta | parvoviridae | 25 |
pan troglodytes | herpesviridae | 25 |
macaca mulatta | picornaviridae | 23 |
pan troglodytes | polyomaviridae | 18 |
macaca fascicularis | herpesviridae | 17 |
macaca mulatta | astroviridae | 17 |
gorilla gorilla | adenoviridae | 14 |
macaca mulatta | herpesviridae | 14 |
macaca mulatta | retroviridae | 13 |
chlorocebus aethiops | herpesviridae | 13 |
saimiri sciureus | herpesviridae | 13 |
chlorocebus aethiops | paramyxoviridae | 12 |
gorilla beringei | adenoviridae | 12 |
gorilla beringei | herpesviridae | 12 |
gorilla gorilla | herpesviridae | 12 |
That’s it! DuckDB is definitely really nice, and I am going to integrate it in my workflow.