Examples
========

The following snippets exemplify how to perform several extractions with querier.


Sample entries
--------------

This example shows how to take a sample from the twitter database and print it in a
human readable way::

    import querier as qr
    from pprint import pprint

    database_name = "twitter_2020"

    with qr.Connection(database_name, credentials_file) as con:
        result = con.extract_one(qr.Filter())
        pprint(result)

Due to the nature of MongoDB databases, entries can have missing fields (use
:py:meth:`querier.Filter.exists` to filter entries where a field is not missing).

It can be useful to sample a database this way first to know the entries' fields.


Extraction from a database
--------------------------

This example shows a way to extract tweets and store them in a *.gz* file. The tweets
extracted were twitted from spain in 2020 between March and September (note that we only
store the fields 'place', 'user' and 'id')::

    import querier as qr
    import json
    import gzip
    from datetime import datetime

    database_name = "twitter_2020"

    # Filter
    date_end = datetime(2020, 9, 30, 0, 0, 0) # September 2020
    date_start = datetime(2020, 3, 1, 0, 0, 0) # March 2020

    f = qr.Filter()
    f.exists('place')
    f.equals('place.country_code', 'ES')
    f.less_or_equals('created_at', date_end)
    f.greater_or_equals('created_at', date_start)

    # The results will have those fields only
    fields = ['place', 'user', 'id']

    # Extraction
    with gzip.open('twitter_2020_ES.gz', 'wb') as outfile:
        with qr.Connection(database_name) as con:

            tweets_es = con.extract(f, fields).limit(100)

            for tweet in tweets_es:
                tweet_str = json.dumps(tweet, default=str)
                outfile.write((tweet_str + '\n').encode('utf-8'))


As databases can be (and often are) massive, it is advised to limit the selected fields
and store entries in files whenever its strictly required. To limit the extraction is a
good practice when testing code that extracts data from the database (as databases can
contain up to millions of entries).


Multiple Connections
--------------------

This example prints (in a human readable way) the tweet from Spain which was most marked
as favorite::

    import querier as qr
    from pprint import pprint

    year_start = 2018
    year_end = 2020
    fields = ['created_at', 'text', 'user', 'retweet_count', 'favorite_count']
    f = qr.Filter() # Empty filter
    max_tweet = None

    for year in range(year_start, year_end + 1):
        database_name = f"twitter_{year}"

        with qr.Connection(database_name) as con:
            result = con.extract(f, fields)
            for tweet in result:

                if max_tweet is None:
                    max_tweet = tweet
                    continue

                if tweet['favorite_count'] > max_tweet['favorite_count']:
                    max_tweet = tweet

    pprint(max_tweet)

It shows a way to extract entries from several databases using more than one connection.
Twitter databases are split by year and named ``twitter_{year}`` requiring more than one
Connection object to extract tweets from different years.


Group by a field and aggregate
------------------------------

Now imagine you would like to extract the places from tweets of a collection which pass
a given filter, and count how many tweets each place is attached to::

    import querier as qr

    tweets_filter = qr.Filter().equals("place.country_code", "FR")

    with qr.Connection("twitter_2020") as con:
        places = (
            con["western_europe"]
             .groupby("place.id", pre_filter=tweets_filter, allowDiskUse=True)
             .agg(
                 name=("place.name", "first"),
                 type=("place.place_type", "first"),
                 nr_tweets=("place.id", "count"),
                 bbox=("place.bounding_box.coordinates", "first"),
             )
        )

        # Get a list of dictionaries with the keys given in `.agg`, plus "_id"
        # corresponding to the grouped-by-key, here "place.id":
        places_dicts = list(places)

        # or iterate through `places` to modify/further filter each entry before keeping
        # them in memory.

The `.agg` method works on the model of named aggregations of
:py:meth:`pandas.core.groupby.DataFrameGroupBy.aggregate`, except we provide a
`NamedTuple` :py:meth:`querier.NamedAgg` with keywords `field` and `aggfunc`. For
reference see `pandas' user guide`_. In this here example we simply passed in tuples
whose first entry corresponds to the field and the second to the aggregation function,
which also works.

.. _pandas' user guide:
    https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html?#named-aggregation


Geographic filters
--------------------------

If you want to select tweets with coordinates within a given place, let's say New York
City::

    import querier as qr
    import geopandas as gpd

    # Note the conversion to 'epsg:4326', the default (longitude, latitude) coordinate
    # reference system.
    nyc_boroughs = gpd.read_file(gpd.datasets.get_path('nybb')).to_crs('epsg:4326')
    f = qr.Filter()

here are several equivalent ways to generate the corresponding filter, first using the
full polygon::

    f.geo_within('coordinates', nyc_boroughs.unary_union)

which may prove rather costly given the complexity of the input polygon. To now generate
a rougher but simpler filter using NYC's bounding box::

    f.geo_within('coordinates', nyc_boroughs.total_bounds, geo_type='bbox')

or equivalently::

    from shapely.geometry import box

    f.geo_within('coordinates', box(*nyc_boroughs.total_bounds))