biocantor.util.bins

This UCSC binning implementation is borrowed from gffutils. The reason for duplicating this code here is to avoid users who do not want to perform GFF3 import/export to install the [io] extras, which includde gffutils, just for this module.

The below documentation is copied from that implementation.

https://github.com/daler/gffutils/blob/master/gffutils/bins.py

Implementation of the UCSC genome binning strategy – heavily commented and with tests to help understand what’s going on.

Ryan Dale 2013

With help from implementations in kent src and Brent Pedersen’s cruzdb, specifically:

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/lib/binRange.c

and

https://github.com/brentp/cruzdb/blob/master/cruzdb/__init__.py

For reference, Fig 7 in http://genome.cshlp.org/content/12/6/996.abstract looks like this:

----------------------------------------------------------------------------
|                                  1                                       |
----------------------------------------------------------------------------
|       2      |         3         |         4         |         5         |
----------------------------------------------------------------------------
|6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
----------------------------------------------------------------------------

     AAAAAAAAAAAAAAAAAAAAAAA          BBBBBBBBBBBBBBB             CC

The smallest bin feature “A” fits within is 1, but it also overlaps 2-3, 7-9. The smallest bin feature “B” fits within is 4, but it also overlaps 1, 14-17. The smallest bin feature “C” fits within is 20, but it also overlaps 1, 5.

Module Contents

Functions

bins(start, stop[, fmt, one])

Uses the definition of a "genomic bin" described in Fig 7 of

Attributes

NEXT_SHIFT

FIRST_SHIFT

OFFSETS

COORD_OFFSETS

MAX_CHROM_SIZE

biocantor.util.bins.NEXT_SHIFT = 3
biocantor.util.bins.FIRST_SHIFT = 17
biocantor.util.bins.OFFSETS
biocantor.util.bins.COORD_OFFSETS
biocantor.util.bins.MAX_CHROM_SIZE
biocantor.util.bins.bins(start, stop, fmt='gff', one=True)

Uses the definition of a “genomic bin” described in Fig 7 of http://genome.cshlp.org/content/12/6/996.abstract. :param one: If one=True (default), then only return the smallest bin that

completely contains these coordinates (useful for assigning a single bin). If one=False, then return the set of all bins that overlap these coordinates (useful for looking for features that could intersect)

Parameters

fmt ('gff' | 'bed') – This specifies 1-based start coords (gff) or 0-based start coords (bed)