Examples

The following snippets exemplify how to perform several extractions with querier.

Sample entries

This example shows how to take a sample from the twitter database and print it in a human readable way:

import querier as qr
from pprint import pprint

database_name = "twitter_2020"

with qr.Connection(database_name, credentials_file) as con:
    result = con.extract_one(qr.Filter())
    pprint(result)

Due to the nature of MongoDB databases, entries can have missing fields (use querier.Filter.exists() to filter entries where a field is not missing).

It can be useful to sample a database this way first to know the entries’ fields.

Extraction from a database

This example shows a way to extract tweets and store them in a .gz file. The tweets extracted were twitted from spain in 2020 between March and September (note that we only store the fields ‘place’, ‘user’ and ‘id’):

import querier as qr
import json
import gzip
from datetime import datetime

database_name = "twitter_2020"

# Filter
date_end = datetime(2020, 9, 30, 0, 0, 0) # September 2020
date_start = datetime(2020, 3, 1, 0, 0, 0) # March 2020

f = qr.Filter()
f.exists('place')
f.equals('place.country_code', 'ES')
f.less_or_equals('created_at', date_end)
f.greater_or_equals('created_at', date_start)

# The results will have those fields only
fields = ['place', 'user', 'id']

# Extraction
with gzip.open('twitter_2020_ES.gz', 'wb') as outfile:
    with qr.Connection(database_name) as con:

        tweets_es = con.extract(f, fields).limit(100)

        for tweet in tweets_es:
            tweet_str = json.dumps(tweet, default=str)
            outfile.write((tweet_str + '\n').encode('utf-8'))

As databases can be (and often are) massive, it is advised to limit the selected fields and store entries in files whenever its strictly required. To limit the extraction is a good practice when testing code that extracts data from the database (as databases can contain up to millions of entries).

Multiple Connections

This example prints (in a human readable way) the tweet from Spain which was most marked as favorite:

import querier as qr
from pprint import pprint

year_start = 2018
year_end = 2020
fields = ['created_at', 'text', 'user', 'retweet_count', 'favorite_count']
f = qr.Filter() # Empty filter
max_tweet = None

for year in range(year_start, year_end + 1):
    database_name = f"twitter_{year}"

    with qr.Connection(database_name) as con:
        result = con.extract(f, fields)
        for tweet in result:

            if max_tweet is None:
                max_tweet = tweet
                continue

            if tweet['favorite_count'] > max_tweet['favorite_count']:
                max_tweet = tweet

pprint(max_tweet)

It shows a way to extract entries from several databases using more than one connection. Twitter databases are split by year and named twitter_{year} requiring more than one Connection object to extract tweets from different years.

Group by a field and aggregate

Now imagine you would like to extract the places from tweets of a collection which pass a given filter, and count how many tweets each place is attached to:

import querier as qr

tweets_filter = qr.Filter().equals("place.country_code", "FR")

with qr.Connection("twitter_2020") as con:
    places = (
        con["western_europe"]
         .groupby("place.id", pre_filter=tweets_filter, allowDiskUse=True)
         .agg(
             name=("place.name", "first"),
             type=("place.place_type", "first"),
             nr_tweets=("place.id", "count"),
             bbox=("place.bounding_box.coordinates", "first"),
         )
    )

    # Get a list of dictionaries with the keys given in `.agg`, plus "_id"
    # corresponding to the grouped-by-key, here "place.id":
    places_dicts = list(places)

    # or iterate through `places` to modify/further filter each entry before keeping
    # them in memory.

The .agg method works on the model of named aggregations of pandas.core.groupby.DataFrameGroupBy.aggregate(), except we provide a NamedTuple querier.NamedAgg() with keywords field and aggfunc. For reference see pandas’ user guide. In this here example we simply passed in tuples whose first entry corresponds to the field and the second to the aggregation function, which also works.

Geographic filters

If you want to select tweets with coordinates within a given place, let’s say New York City:

import querier as qr
import geopandas as gpd

# Note the conversion to 'epsg:4326', the default (longitude, latitude) coordinate
# reference system.
nyc_boroughs = gpd.read_file(gpd.datasets.get_path('nybb')).to_crs('epsg:4326')
f = qr.Filter()

here are several equivalent ways to generate the corresponding filter, first using the full polygon:

f.geo_within('coordinates', nyc_boroughs.unary_union)

which may prove rather costly given the complexity of the input polygon. To now generate a rougher but simpler filter using NYC’s bounding box:

f.geo_within('coordinates', nyc_boroughs.total_bounds, geo_type='bbox')

or equivalently:

from shapely.geometry import box

f.geo_within('coordinates', box(*nyc_boroughs.total_bounds))