PyPop.Filter#

Filters for pre-filtering of data files before analysis.

This module includes filters that modify or otherwise transform the input data before being passed to PyPop analysis.

Exceptions#

SubclassError

Customized exception if a subclass doesn't implement required methods.

Classes#

Filter

Abstract base class for all Filters.

PassThroughFilter

A filter that doesn't change input data.

AnthonyNolanFilter

Filters data via Anthony Nolan's allele call data.

BinningFilter

Filters original data into "bins".

AlleleCountAnthonyNolanFilter

Filters data with an allelecount less than a threshold.

Module Contents#

exception SubclassError#

Bases: Exception

Inheritance diagram of PyPop.Filter.SubclassError

Customized exception if a subclass doesn’t implement required methods.

Initialize self. See help(type(self)) for accurate signature.

class Filter#

Bases: abc.ABC

Inheritance diagram of PyPop.Filter.Filter

Abstract base class for all Filters.

abstractmethod doFiltering(matrix=None)#
abstractmethod startFirstPass(locus)#
abstractmethod checkAlleleName(alleleName)#
abstractmethod addAllele(alleleName)#
abstractmethod endFirstPass()#
abstractmethod startFiltering()#
abstractmethod filterAllele(alleleName)#
abstractmethod endFiltering()#
abstractmethod writeToLog(logstring=None)#
abstractmethod cleanup()#
class PassThroughFilter#

Bases: Filter

Inheritance diagram of PyPop.Filter.PassThroughFilter

A filter that doesn’t change input data.

doFiltering(matrix=None)#
startFirstPass(locus)#
checkAlleleName(alleleName)#
addAllele(alleleName)#
endFirstPass()#
startFiltering()#
filterAllele(alleleName)#
endFiltering()#
writeToLog(logstring=None)#
cleanup()#
class AnthonyNolanFilter(directoryName=None, remoteMSF=None, alleleFileFormat='msf', preserveAmbiguousFlag=0, preserveUnknownFlag=0, preserveLowresFlag=0, alleleDesignator='*', logFile=None, untypedAllele='****', unsequencedSite='#', sequenceFileSuffix='_prot', filename=None, numDigits=4, verboseFlag=1, debug=0, sequenceFilterMethod='strict')#

Bases: Filter

Inheritance diagram of PyPop.Filter.AnthonyNolanFilter

Filters data via Anthony Nolan’s allele call data.

Allele call data files can be of either txt or msf formats.

Base class parameters.

Parameters:
  • directoryName (str) – directory that AnthonyNolan allele data is located

  • remoteMSF (str) – Specifies the version (tag) of the remote msf directory in the IMGT-HLA GitHub repo. If present, the remote MSF files for the specified version will be downloaded on-demand, and cached for later reuse

  • alleleFileFormat (str, optional) – file format, can be txt or msf (default). Use of msf files is required in order to translate allele codes into polymorphic sequence data.

  • preserveAmbiguousFlag (int, optional) – If set to 0 (default) then ambiguitity is removed (e.g. 010101/0102/010301 will truncate this to 0101). To preserve the ambiguity, set the option to 1 (for this example, it will result in a filtered allele “name” of 0101/0102/0103)

  • preserveUnknownFlag (int, optional) – If set to 0 (default) replace unknown alleles with the untypedAllele designator. To keep unrecognized allele names set to 1.

  • preserveLowresFlag (int, optional) – This option is similar to preserveUnknownFlag, but only applies to lowres alleles. If set to 1, PyPop will keep allele names that are shorter than the default allele name length, usually 4 digits long. But if preserveUnknownFlag is set, this option has no effect, because all unknown alleles are preserved.

  • alleleDesignator (str, optional) – the designator used to indicate a locus name (default *),

  • logFile (str, optional) – log file

  • untypedAllele (str, optional) – defaults to ****

  • unsequencedSite (str, optional) – defaults to #

  • sequenceFileSuffix (str, optional) – Suffix for file names used for finding sequences each allele. (e.g.,, if the file for locus A is A_prot.msf, then keep the default be _prot. For nucleotide sequence files, this would be set _nuc.

  • filename (str, optional) – Currently not used

  • numDigits (int, optional) – Number of digits used for HLA data (default 4)

  • verboseFlag (int, optional) – Verbose output (default is on, i.e. 1)

  • debug (int, optional) – Enable debugging (default, off 0),

  • sequenceFilterMethod (str, optional) – matching alleles to sequence, defaults to strict, can also be greedy

doFiltering(matrix=None)#

Do filtering on the provided matrix.

Parameters:

matrix (StringMatrix) – matrix to be filteredng

Returns:

returns processed matrix for further downstream processing

Return type:

StringMatrix

startFirstPass(locus)#

Start the first pass of filtering.

Parameters:

locus (str) – locus to start filtering

See also

Must be paired with a subsequent endFirstPass()

checkAlleleName(alleleName)#

Checks allele name against the database.

Parameters:

alleleName (str) – allele name

Returns:

returns the original allele truncated to appropriate number of digits, if it can’t be found using any of the heuristics, return it as an untypedAllele (normally ****).

Return type:

str

addAllele(alleleName)#

Add allele to be filtered.

Parameters:

alleleName (str) – process allele to be filtered

endFirstPass()#

End first pass of filtering.

See also

Must be paired with a previous startFirstPass()

startFiltering()#

Start the main filtering.

See also

must be paired with a subsequent endFiltering()

filterAllele(alleleName)#

Filter a specified allele.

Parameters:

alleleName (str) – allele to filter

Returns:

return the translated allele

Return type:

dict

endFiltering()#

End filtering.

See also

Must be paired with a previous startFiltering()

writeToLog(logstring='\n')#

Write a string to log.

Parameters:

logstring (str) – defaults to line feed

cleanup()#

Do any cleanups.

makeSeqDictionaries(matrix=None, locus=None)#

Make a sequence dictionary for a given locus.

Parameters:
Returns:

polyseq (dict): Keyed on locus*allele of all allele sequences, containing ONLY the polymorphic positions.

polyseqpos (dict): Keyed on locus of the positions of the polymorphic residues which you find in polyseq.

Return type:

tuple

Raises:

RuntimeError – If the alignment length could not be found in the MSF header.

translateMatrix(matrix=None)#

Translate the whole matrix (all loci).

Parameters:

matrix (StringMatrix) – matrix to translate

Returns:

new instance with sequence data in columns

Return type:

StringMatrix

class BinningFilter(customBinningDict=None, logFile=None, untypedAllele='****', filename=None, binningDigits=4, debug=0)#

Filters original data into “bins”.

This can be done through either digits (for HLA alleles) or custom rules defined a file for each locus.

Parameters:
  • customBinningDict (dict, optional) – a custom binning dict, this is keyed by locus, but each key consists of a series of lines, each line containing ruleset of which alleles belong in a given bin

  • logFile (str, optional) – output logfilek, must be set

  • untypedAllele (str, optional) – defaults to ****

  • filename (str, optional) – filename (unused), defaults to None

  • binningDigits (int, optional) – defaults to 4

  • debug (int, optional) – enable debugging (defaults to none, i.e. 0)

doDigitBinning(matrix=None)#

Do the digit binning on specified matrix.

Note

Digit binning is done only if binningDigits is set.

Parameters:

matrix (StringMatrix) – matrix to modify

Returns:

the modified matrix

Return type:

StringMatrix

doCustomBinning(matrix=None)#

Do the custom binning on specified matrix.

Note

Custom binning is done only if customBinningDict is set.

Parameters:

matrix (StringMatrix) – matrix to modify

Returns:

the modified matrix

Return type:

StringMatrix

lookupCustomBinning(testAllele, locus)#

Apply custom binning rules to a allele and locus pair.

Parameters:
  • testAllele (str) – allele to check

  • locus (str) – locus to check

Returns:

binned (or not) allele

Return type:

str

class AlleleCountAnthonyNolanFilter(lumpThreshold=None, **kw)#

Bases: AnthonyNolanFilter

Inheritance diagram of PyPop.Filter.AlleleCountAnthonyNolanFilter

Filters data with an allelecount less than a threshold.

Parameters:

lumpThreshold (int) – set threshold

Base class parameters.

Parameters:
  • directoryName (str) – directory that AnthonyNolan allele data is located

  • remoteMSF (str) – Specifies the version (tag) of the remote msf directory in the IMGT-HLA GitHub repo. If present, the remote MSF files for the specified version will be downloaded on-demand, and cached for later reuse

  • alleleFileFormat (str, optional) – file format, can be txt or msf (default). Use of msf files is required in order to translate allele codes into polymorphic sequence data.

  • preserveAmbiguousFlag (int, optional) – If set to 0 (default) then ambiguitity is removed (e.g. 010101/0102/010301 will truncate this to 0101). To preserve the ambiguity, set the option to 1 (for this example, it will result in a filtered allele “name” of 0101/0102/0103)

  • preserveUnknownFlag (int, optional) – If set to 0 (default) replace unknown alleles with the untypedAllele designator. To keep unrecognized allele names set to 1.

  • preserveLowresFlag (int, optional) – This option is similar to preserveUnknownFlag, but only applies to lowres alleles. If set to 1, PyPop will keep allele names that are shorter than the default allele name length, usually 4 digits long. But if preserveUnknownFlag is set, this option has no effect, because all unknown alleles are preserved.

  • alleleDesignator (str, optional) – the designator used to indicate a locus name (default *),

  • logFile (str, optional) – log file

  • untypedAllele (str, optional) – defaults to ****

  • unsequencedSite (str, optional) – defaults to #

  • sequenceFileSuffix (str, optional) – Suffix for file names used for finding sequences each allele. (e.g.,, if the file for locus A is A_prot.msf, then keep the default be _prot. For nucleotide sequence files, this would be set _nuc.

  • filename (str, optional) – Currently not used

  • numDigits (int, optional) – Number of digits used for HLA data (default 4)

  • verboseFlag (int, optional) – Verbose output (default is on, i.e. 1)

  • debug (int, optional) – Enable debugging (default, off 0),

  • sequenceFilterMethod (str, optional) – matching alleles to sequence, defaults to strict, can also be greedy

endFirstPass()#

End first pass and then lump alleles.

First process regular AnthonyNolanFilter then modify all alleles with a count < lumpThreshold to lump.