CollecTor
*********

Descriptor archives are available from CollecTor. If you need Tor’s
topology at a prior point in time this is the place to go!

With CollecTor you can either read descriptors directly…

   import datetime
   import stem.descriptor.collector

   yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)

   # provide yesterday's exits

   exits = {}

   for desc in stem.descriptor.collector.get_server_descriptors(start = yesterday):
     if desc.exit_policy.is_exiting_allowed():
       exits[desc.fingerprint] = desc

   print('%i relays published an exiting policy today...\n' % len(exits))

   for fingerprint, desc in exits.items():
     print('  %s (%s)' % (desc.nickname, fingerprint))

… or download the descriptors to disk and read them later.

   import datetime
   import stem.descriptor
   import stem.descriptor.collector

   yesterday = datetime.datetime.utcnow() - datetime.timedelta(days = 1)
   cache_dir = '~/descriptor_cache/server_desc_today'

   collector = stem.descriptor.collector.CollecTor()

   for f in collector.files('server-descriptor', start = yesterday):
     f.download(cache_dir)

   # then later...

   for f in collector.files('server-descriptor', start = yesterday):
     for desc in f.read(cache_dir):
       if desc.exit_policy.is_exiting_allowed():
         print('  %s (%s)' % (desc.nickname, desc.fingerprint))

   get_instance - Provides a singleton CollecTor used for...
     |- get_server_descriptors - published server descriptors
     |- get_extrainfo_descriptors - published extrainfo descriptors
     |- get_microdescriptors - published microdescriptors
     |- get_consensus - published router status entries
     |
     |- get_key_certificates - authority key certificates
     |- get_bandwidth_files - bandwidth authority heuristics
     +- get_exit_lists - TorDNSEL exit list

   File - Individual file residing within CollecTor
     |- read - provides descriptors from this file
     +- download - download this file to disk

   CollecTor - Downloader for descriptors from CollecTor
     |- get_server_descriptors - published server descriptors
     |- get_extrainfo_descriptors - published extrainfo descriptors
     |- get_microdescriptors - published microdescriptors
     |- get_consensus - published router status entries
     |
     |- get_key_certificates - authority key certificates
     |- get_bandwidth_files - bandwidth authority heuristics
     |- get_exit_lists - TorDNSEL exit list
     |
     |- index - metadata for content available from CollecTor
     +- files - files available from CollecTor

New in version 1.8.0.

stem.descriptor.collector.get_instance()

   Provides the singleton "CollecTor" used for this module’s shorthand
   functions.

   Returns:
      singleton "CollecTor" instance

stem.descriptor.collector.get_server_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)

   Shorthand for "get_server_descriptors()" on our singleton instance.

stem.descriptor.collector.get_extrainfo_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)

   Shorthand for "get_extrainfo_descriptors()" on our singleton
   instance.

stem.descriptor.collector.get_microdescriptors(start=None, end=None, cache_to=None, timeout=None, retries=3)

   Shorthand for "get_microdescriptors()" on our singleton instance.

stem.descriptor.collector.get_consensus(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)

   Shorthand for "get_consensus()" on our singleton instance.

stem.descriptor.collector.get_key_certificates(start=None, end=None, cache_to=None, timeout=None, retries=3)

   Shorthand for "get_key_certificates()" on our singleton instance.

stem.descriptor.collector.get_bandwidth_files(start=None, end=None, cache_to=None, timeout=None, retries=3)

   Shorthand for "get_bandwidth_files()" on our singleton instance.

stem.descriptor.collector.get_exit_lists(start=None, end=None, cache_to=None, timeout=None, retries=3)

   Shorthand for "get_exit_lists()" on our singleton instance.

class stem.descriptor.collector.File(path, types, size, sha256, first_published, last_published, last_modified)

   Bases: "object"

   File within CollecTor.

   Variables:
      * **path** (*str*) – file path within collector

      * **types** (*tuple*) – descriptor types contained within this
        file

      * **compression** (*stem.descriptor.Compression*) – file
        compression, **None** if this cannot be determined

      * **size** (*int*) – size of the file

      * **sha256** (*str*) – file’s sha256 checksum

      * **start** (*datetime*) – first publication within the file,
        **None** if this cannot be determined

      * **end** (*datetime*) – last publication within the file,
        **None** if this cannot be determined

      * **last_modified** (*datetime*) – when the file was last
        modified

   read(directory=None, descriptor_type=None, start=None, end=None, document_handler='ENTRIES', timeout=None, retries=3)

      Provides descriptors from this archive. Descriptors are
      downloaded or read from disk as follows…

      * If this file has already been downloaded through
        :func:>>`<<~stem.descriptor.collector.CollecTor.download’
        these descriptors are read from disk.

      * If a **directory** argument is provided and the file is
        already present these descriptors are read from disk.

      * If a **directory** argument is provided and the file is not
        present the file is downloaded this location then read.

      * If the file has neither been downloaded and no **directory**
        argument is provided then the file is downloaded to a
        temporary directory that’s deleted after it is read.

      Parameters:
         * **directory** (*str*) – destination to download into

         * **descriptor_type** (*str*) – descriptor type, this is
           guessed if not provided

         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **document_handler**
           (*stem.descriptor.__init__.DocumentHandler*) – method in
           which to parse a "NetworkStatusDocument"

         * **timeout** (*int*) – timeout when connection becomes
           idle, no timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose

      Returns:
         iterator for "Descriptor" instances in the file

      Raises:
         * **ValueError** if unable to determine the descirptor type

         * **TypeError** if we cannot parse this descriptor type

         * "DownloadFailed" if the download fails

   download(directory, decompress=True, timeout=None, retries=3, overwrite=False)

      Downloads this file to the given location. If a file already
      exists this is a no-op.

      Parameters:
         * **directory** (*str*) – destination to download into

         * **decompress** (*bool*) – decompress written file

         * **timeout** (*int*) – timeout when connection becomes
           idle, no timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose

         * **overwrite** (*bool*) – if this file exists but
           mismatches CollecTor’s checksum then overwrites if
           **True**, otherwise rases an exception

      Returns:
         **str** with the path we downloaded to

      Raises:
         * "DownloadFailed" if the download fails

         * **IOError** if a mismatching file exists and
           **overwrite** is **False**

class stem.descriptor.collector.CollecTor(retries=2, timeout=None)

   Bases: "object"

   Downloader for descriptors from CollecTor. The contents of
   CollecTor are provided in an index that’s fetched as required.

   Variables:
      * **retries** (*int*) – number of times to attempt the request
        if downloading it fails

      * **timeout** (*float*) – duration before we’ll time out our
        request

   get_server_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)

      Provides server descriptors published during the given time
      range, sorted oldest to newest.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **bridge** (*bool*) – standard descriptors if **False**,
           bridge if **True**

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of "ServerDescriptor" for the given time range

      Raises:
         "DownloadFailed" if the download fails

   get_extrainfo_descriptors(start=None, end=None, cache_to=None, bridge=False, timeout=None, retries=3)

      Provides extrainfo descriptors published during the given time
      range, sorted oldest to newest.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **bridge** (*bool*) – standard descriptors if **False**,
           bridge if **True**

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of "RelayExtraInfoDescriptor" for the given time
         range

      Raises:
         "DownloadFailed" if the download fails

   get_microdescriptors(start=None, end=None, cache_to=None, timeout=None, retries=3)

      Provides microdescriptors estimated to be published during the
      given time range, sorted oldest to newest. Unlike
      server/extrainfo descriptors, microdescriptors change very
      infrequently…

         "Microdescriptors are expected to be relatively static and only change
         about once per week." -dir-spec section 3.3

      CollecTor archives only contain microdescriptors that *change*,
      so hourly tarballs often contain very few. Microdescriptors also
      do not contain their publication timestamp, so this is
      estimated.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of
         :class:>>`<<~stem.descriptor.microdescriptor.Microdescriptor
         for the given time range

      Raises:
         "DownloadFailed" if the download fails

   get_consensus(start=None, end=None, cache_to=None, document_handler='ENTRIES', version=3, microdescriptor=False, bridge=False, timeout=None, retries=3)

      Provides consensus router status entries published during the
      given time range, sorted oldest to newest.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **document_handler**
           (*stem.descriptor.__init__.DocumentHandler*) – method in
           which to parse a "NetworkStatusDocument"

         * **version** (*int*) – consensus variant to retrieve
           (versions 2 or 3)

         * **microdescriptor** (*bool*) – provides the
           microdescriptor consensus if **True**, standard consensus
           otherwise

         * **bridge** (*bool*) – standard descriptors if **False**,
           bridge if **True**

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of "RouterStatusEntry" for the given time range

      Raises:
         "DownloadFailed" if the download fails

   get_key_certificates(start=None, end=None, cache_to=None, timeout=None, retries=3)

      Directory authority key certificates for the given time range,
      sorted oldest to newest.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of
         :class:>>`<<~stem.descriptor.networkstatus.KeyCertificate for
         the given time range

      Raises:
         "DownloadFailed" if the download fails

   get_bandwidth_files(start=None, end=None, cache_to=None, timeout=None, retries=3)

      Bandwidth authority heuristics for the given time range, sorted
      oldest to newest.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of
         :class:>>`<<~stem.descriptor.bandwidth_file.BandwidthFile for
         the given time range

      Raises:
         "DownloadFailed" if the download fails

   get_exit_lists(start=None, end=None, cache_to=None, timeout=None, retries=3)

      TorDNSEL exit lists for the given time range, sorted oldest to
      newest.

      Parameters:
         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

         * **cache_to** (*str*) – directory to cache archives into,
           if an archive is available here it is not downloaded

         * **timeout** (*int*) – timeout for downloading each
           individual archive when the connection becomes idle, no
           timeout applied if **None**

         * **retries** (*int*) – maximum attempts to impose on a
           per- archive basis

      Returns:
         **iterator** of
         :class:>>`<<~stem.descriptor.tordnsel.TorDNSEL for the given
         time range

      Raises:
         "DownloadFailed" if the download fails

   index(compression='best')

      Provides the archives available in CollecTor.

      Parameters:
         **compression** (*descriptor.Compression*) – compression type
         to download from, if undefiled we’ll use the best
         decompression available

      Returns:
         **dict** with the archive contents

      Raises:
         If unable to retrieve the index this provide…

            * **ValueError** if json is malformed

            * **IOError** if unable to decompress

            * "DownloadFailed" if the download fails

   files(descriptor_type=None, start=None, end=None)

      Provides files CollecTor presently has, sorted oldest to newest.

      Parameters:
         * **descriptor_type** (*str*) – descriptor type or prefix
           to retrieve

         * **start** (*datetime.datetime*) – publication time to
           begin with

         * **end** (*datetime.datetime*) – publication time to end
           with

      Returns:
         **list** of "File"

      Raises:
         If unable to retrieve the index this provide…

            * **ValueError** if json is malformed

            * **IOError** if unable to decompress

            * "DownloadFailed" if the download fails
