PyPop.Filter#
Filters for pre-filtering of data files before analysis.
This module includes filters that modify or otherwise transform the input data before being passed to PyPop analysis.
Exceptions#
Customized exception if a subclass doesn't implement required methods. |
Classes#
Abstract base class for all Filters. |
|
A filter that doesn't change input data. |
|
Filters data via Anthony Nolan's allele call data. |
|
Filters original data into "bins". |
|
Filters data with an allelecount less than a threshold. |
Module Contents#
- exception SubclassError#
Bases:
Exception
Customized exception if a subclass doesn’t implement required methods.
Initialize self. See help(type(self)) for accurate signature.
- class Filter#
Bases:
abc.ABC
Abstract base class for all Filters.
- abstractmethod doFiltering(matrix=None)#
- abstractmethod startFirstPass(locus)#
- abstractmethod checkAlleleName(alleleName)#
- abstractmethod addAllele(alleleName)#
- abstractmethod endFirstPass()#
- abstractmethod startFiltering()#
- abstractmethod filterAllele(alleleName)#
- abstractmethod endFiltering()#
- abstractmethod writeToLog(logstring=None)#
- abstractmethod cleanup()#
- class PassThroughFilter#
Bases:
Filter
A filter that doesn’t change input data.
- doFiltering(matrix=None)#
- startFirstPass(locus)#
- checkAlleleName(alleleName)#
- addAllele(alleleName)#
- endFirstPass()#
- startFiltering()#
- filterAllele(alleleName)#
- endFiltering()#
- writeToLog(logstring=None)#
- cleanup()#
- class AnthonyNolanFilter(directoryName=None, remoteMSF=None, alleleFileFormat='msf', preserveAmbiguousFlag=0, preserveUnknownFlag=0, preserveLowresFlag=0, alleleDesignator='*', logFile=None, untypedAllele='****', unsequencedSite='#', sequenceFileSuffix='_prot', filename=None, numDigits=4, verboseFlag=1, debug=0, sequenceFilterMethod='strict')#
Bases:
Filter
Filters data via Anthony Nolan’s allele call data.
Allele call data files can be of either
txt
ormsf
formats.txt
files available at http://www.anthonynolan.commsf
files available at ftp://ftp.ebi.ac.uk/pub/databases/imgt/mhc/hla/
Base class parameters.
- Parameters:
directoryName (str) – directory that AnthonyNolan allele data is located
remoteMSF (str) – Specifies the version (tag) of the remote
msf
directory in the IMGT-HLA GitHub repo. If present, the remote MSF files for the specified version will be downloaded on-demand, and cached for later reusealleleFileFormat (str, optional) – file format, can be
txt
ormsf
(default). Use ofmsf
files is required in order to translate allele codes into polymorphic sequence data.preserveAmbiguousFlag (int, optional) – If set to
0
(default) then ambiguitity is removed (e.g.010101/0102/010301
will truncate this to0101
). To preserve the ambiguity, set the option to1
(for this example, it will result in a filtered allele “name” of0101/0102/0103
)preserveUnknownFlag (int, optional) – If set to
0
(default) replace unknown alleles with theuntypedAllele
designator. To keep unrecognized allele names set to1
.preserveLowresFlag (int, optional) – This option is similar to
preserveUnknownFlag
, but only applies to lowres alleles. If set to1
, PyPop will keep allele names that are shorter than the default allele name length, usually 4 digits long. But ifpreserveUnknownFlag
is set, this option has no effect, because all unknown alleles are preserved.alleleDesignator (str, optional) – the designator used to indicate a locus name (default
*
),logFile (str, optional) – log file
untypedAllele (str, optional) – defaults to
****
unsequencedSite (str, optional) – defaults to
#
sequenceFileSuffix (str, optional) – Suffix for file names used for finding sequences each allele. (e.g.,, if the file for locus
A
isA_prot.msf
, then keep the default be_prot
. For nucleotide sequence files, this would be set_nuc
.filename (str, optional) – Currently not used
numDigits (int, optional) – Number of digits used for HLA data (default
4
)verboseFlag (int, optional) – Verbose output (default is on, i.e.
1
)debug (int, optional) – Enable debugging (default, off
0
),sequenceFilterMethod (str, optional) – matching alleles to sequence, defaults to
strict
, can also begreedy
- doFiltering(matrix=None)#
Do filtering on the provided matrix.
- Parameters:
matrix (StringMatrix) – matrix to be filteredng
- Returns:
returns processed matrix for further downstream processing
- Return type:
- startFirstPass(locus)#
Start the first pass of filtering.
- Parameters:
locus (str) – locus to start filtering
See also
Must be paired with a subsequent
endFirstPass()
- checkAlleleName(alleleName)#
Checks allele name against the database.
- addAllele(alleleName)#
Add allele to be filtered.
- Parameters:
alleleName (str) – process allele to be filtered
- endFirstPass()#
End first pass of filtering.
See also
Must be paired with a previous
startFirstPass()
- startFiltering()#
Start the main filtering.
See also
must be paired with a subsequent
endFiltering()
- filterAllele(alleleName)#
Filter a specified allele.
- endFiltering()#
End filtering.
See also
Must be paired with a previous
startFiltering()
- writeToLog(logstring='\n')#
Write a string to log.
- Parameters:
logstring (str) – defaults to line feed
- cleanup()#
Do any cleanups.
- makeSeqDictionaries(matrix=None, locus=None)#
Make a sequence dictionary for a given locus.
- Parameters:
matrix (StringMatrix) – matrix to use.
locus (str) – locus to use.
- Returns:
polyseq (dict): Keyed on
locus*allele
of all allele sequences, containing ONLY the polymorphic positions.polyseqpos (dict): Keyed on
locus
of the positions of the polymorphic residues which you find inpolyseq
.- Return type:
- Raises:
RuntimeError – If the alignment length could not be found in the MSF header.
- translateMatrix(matrix=None)#
Translate the whole matrix (all loci).
- Parameters:
matrix (StringMatrix) – matrix to translate
- Returns:
new instance with sequence data in columns
- Return type:
- class BinningFilter(customBinningDict=None, logFile=None, untypedAllele='****', filename=None, binningDigits=4, debug=0)#
Filters original data into “bins”.
This can be done through either digits (for HLA alleles) or custom rules defined a file for each locus.
- Parameters:
customBinningDict (dict, optional) – a custom binning dict, this is keyed by locus, but each key consists of a series of lines, each line containing ruleset of which alleles belong in a given bin
logFile (str, optional) – output logfilek, must be set
untypedAllele (str, optional) – defaults to
****
filename (str, optional) – filename (unused), defaults to
None
binningDigits (int, optional) – defaults to
4
debug (int, optional) – enable debugging (defaults to none, i.e.
0
)
- doDigitBinning(matrix=None)#
Do the digit binning on specified matrix.
Note
Digit binning is done only if
binningDigits
is set.- Parameters:
matrix (StringMatrix) – matrix to modify
- Returns:
the modified matrix
- Return type:
- doCustomBinning(matrix=None)#
Do the custom binning on specified matrix.
Note
Custom binning is done only if
customBinningDict
is set.- Parameters:
matrix (StringMatrix) – matrix to modify
- Returns:
the modified matrix
- Return type:
- class AlleleCountAnthonyNolanFilter(lumpThreshold=None, **kw)#
Bases:
AnthonyNolanFilter
Filters data with an allelecount less than a threshold.
- Parameters:
lumpThreshold (int) – set threshold
Base class parameters.
- Parameters:
directoryName (str) – directory that AnthonyNolan allele data is located
remoteMSF (str) – Specifies the version (tag) of the remote
msf
directory in the IMGT-HLA GitHub repo. If present, the remote MSF files for the specified version will be downloaded on-demand, and cached for later reusealleleFileFormat (str, optional) – file format, can be
txt
ormsf
(default). Use ofmsf
files is required in order to translate allele codes into polymorphic sequence data.preserveAmbiguousFlag (int, optional) – If set to
0
(default) then ambiguitity is removed (e.g.010101/0102/010301
will truncate this to0101
). To preserve the ambiguity, set the option to1
(for this example, it will result in a filtered allele “name” of0101/0102/0103
)preserveUnknownFlag (int, optional) – If set to
0
(default) replace unknown alleles with theuntypedAllele
designator. To keep unrecognized allele names set to1
.preserveLowresFlag (int, optional) – This option is similar to
preserveUnknownFlag
, but only applies to lowres alleles. If set to1
, PyPop will keep allele names that are shorter than the default allele name length, usually 4 digits long. But ifpreserveUnknownFlag
is set, this option has no effect, because all unknown alleles are preserved.alleleDesignator (str, optional) – the designator used to indicate a locus name (default
*
),logFile (str, optional) – log file
untypedAllele (str, optional) – defaults to
****
unsequencedSite (str, optional) – defaults to
#
sequenceFileSuffix (str, optional) – Suffix for file names used for finding sequences each allele. (e.g.,, if the file for locus
A
isA_prot.msf
, then keep the default be_prot
. For nucleotide sequence files, this would be set_nuc
.filename (str, optional) – Currently not used
numDigits (int, optional) – Number of digits used for HLA data (default
4
)verboseFlag (int, optional) – Verbose output (default is on, i.e.
1
)debug (int, optional) – Enable debugging (default, off
0
),sequenceFilterMethod (str, optional) – matching alleles to sequence, defaults to
strict
, can also begreedy
- endFirstPass()#
End first pass and then lump alleles.
First process regular
AnthonyNolanFilter
then modify all alleles with acount
<lumpThreshold
tolump
.