snp_haplotyper package
Submodules
snp_haplotyper.autosomal_dominant_logic module
- snp_haplotyper.autosomal_dominant_logic.autosomal_dominant_analysis(df, affected_partner, unaffected_partner, reference, reference_status, reference_relationship)
Identifies any SNP site which could be used to inform a decision regarding inheriting an autosomal dominant condition (“informative” SNPs) and categorizes the site as indicating “high_risk” or “low_risk” of inheriting an autosomal dominant condition if that site is heterozygous (AB) in the embryo. # TODO change google doc link to link to read the docs. The full logic behind the function is described here TODO This function asks the question - For the combination of haplotypes present in the reference trio (reference, unaffected_partner and affected_partner) what information would this SNP provide us with if the SNP was AB in the embryo - does the SNP indicate “high_risk”, “low_risk”, or is it “uninformative”?
- Parameters
df (dataframe) – A dataframe containing SNP input data for both partners, reference, & embryos affected_partner (string): Column name in dataframe refering to affected_partner’s data unaffected_partner (string): Column name in dataframe refering to unaffected_partner’s data reference (string) : Column name in dataframe refering to reference’s data reference_status (string) : “affected” or “unaffected” reference_relationship (string) : “grandparent” or “child”
- Returns
Dataframe with a “snp_risk_category” column added, used to categorise the SNPs as “high_risk”, “low_risk”, and “uninformative”
- Return type
dataframe
snp_haplotyper.autosomal_recessive_logic module
- snp_haplotyper.autosomal_recessive_logic.autosomal_recessive_analysis(df, male_partner, female_partner, reference, reference_status, consanguineous)
Identifies any “informative” probes because, if the embryo was heterozygous, we could identify who an allele was inherited from and if it is shared by the affected/unaffected reference. For example, if the male partner is AA and female partner AB and the B allele is also in the affected reference we could say that if the embryo was AB it must have inherited a high risk allele from the female partner. The full logic behind the function is described inthe documentation. This function asks the question - For the combination of haplotypes present in the reference trio (reference, unaffected_partner and affected_partner) what information would this SNP provide us with if the SNP was AB in the embryo - does the SNP indicate “high_risk”, “low_risk”, or is it “uninformative”? Additionally for consanguineous cases (where “informative” SNPs can be limited) it also considers additional sites in the embryo (AA,BB) to boost the number of informative SNPs. NOTE: Consanguineous cases are not currently supported.
- Parameters
df (dataframe) – A dataframe containing SNP input data for both partners, reference, & embryos
male_partner (string) – Column name in dataframe refering to male_partner’s data
female_partner (string) – Column name in dataframe refering to female_partner’s data
reference (string) – Column name in dataframe refering to reference’s data
reference_status (string) – “affected” or “unaffected”
reference_relationship (string) – “grandparent” or “child”
consanguineous (boolean) – Flag indicating whether parents are consanguineous
- Returns
Dataframe with a “snp_risk_category” column added, used to categorise the SNPs as “high_risk”, “low_risk”, and “uninformative”, and a “snp_inherited_from” column indicating which partner the risk is inherited
- Return type
dataframe
snp_haplotyper.exceptions module
- exception snp_haplotyper.exceptions.ArgumentInputError
Bases:
snp_haplotyper.exceptions.ErrorRaised when the input parameters do not make sense considering the underlying biology
- exception snp_haplotyper.exceptions.Error
Bases:
ExceptionBase class for other exceptions
- exception snp_haplotyper.exceptions.InvalidParameterSelectedError
Bases:
snp_haplotyper.exceptions.ErrorThe config.py has a flag prohibiting the script from running with the selected parameters. This is to prevent the user from running the script for options which have not yet been validated.
snp_haplotyper.sample_sheet_reader module
snp_haplotyper.snp_haplotype module
- snp_haplotyper.snp_haplotype.add_embryo_sex_to_column_name(html_string, embryo_ids, embryo_sex)
Annotated any table with with embryo data with the sex of the embryos
- Parameters
html_string (string) – A HTML formated table with embryo ID column names
embryo_ids (list) – A list of embryo IDs
embryo_sex (list) – A list of embryo sexes coressponding to the embryo ids
- Returns
HTML formated table with the column headings annotated with the embryo sex.
- Return type
String
- snp_haplotyper.snp_haplotype.add_rsid_column(df, affy_2_rs_ids_df)
Provides dbsnp rsIDs
New column created in the dataframe, df, matching the probes_set IDs to dbSNP rsIDs.
- Parameters
df (dataframe) – A dataframe with a “probeset_id” column
affy_2_rs_ids_df (dataframe) – A dataframe with columns “probeset_id” & “rsID” used to map between the 2 identifiers
- Returns
Original dataframe, df, with columns for “rsID” added next to the “probeset_id” column (these columns are now the 1st columns of the dataframe)
- Return type
dataframe
- snp_haplotyper.snp_haplotype.annotate_distance_from_gene(df, chr, start, end)
Annotates the probeset based on the provided genomic co-ordinates
New column created, “gene_distance”, in the dataframe, df, annotating the region the SNP is in. SNPs allocated to “within_gene”, “0-1MB_from_start”, “1-2MB_from_start”, “0-1MB_from_end”, and “1-2MB_from_end”,
- Parameters
df (dataframe) – A dataframe with a “probeset_id” column and the feature’s genomic co-ordinates, “Position”
chr (string) – The chromsome the gene of interest is on
start (int) – The start coordinate of the gene (1-based)
end (int) – The end coordinate of the gene (1-based)
- Returns
Original dataframe, df, with “gene_distance” column added characterising the probeset in relation to the gene of interest
- Return type
dataframe
- snp_haplotyper.snp_haplotype.annotate_snp_position(df)
For a dataframe with a “gene_distance” column this adds a “snp_position” column. This is useful for summarising data in the column,
- Parameters
df (dataframe) – A dataframe with “gene_distance” column with category values in the range: “1-2MB_from_start”, “0-1MB_from_start”, “within_gene”, “0-1MB_from_end”, “1-2MB_from_end”,
- Returns
Dataframe with new column “snp_position”, with the category values “upstream”, “within_gene”, and “downstream”.
- Return type
dataframe
- snp_haplotyper.snp_haplotype.calculate_nocall_percentages(df)
Calculate the percentage of nocalls
Takes a dataframe produced from calculate_qc_metrics() and calculates the % of nocalls per sample.
- Parameters
df (dataframe) – A dataframe produced by calculate_qc_metrics()
- Returns
Dataframe summarising the % of NoCalls per sample, with the same format as the dataframe produced by calculate_qc_metrics() Essentially provides a row for the % of NoCalls per sample
- Return type
dataframe
- snp_haplotyper.snp_haplotype.calculate_qc_metrics(df, male_partner, female_partner, reference, embryo_ids)
Calculate QC metrics based on the number of NoCalls per sample (measure of DNA quality)
Calculate QC metrics based on the number of NoCalls per sample which can be used as a metric of DNA quality.
- Parameters
df (dataframe) – A dataframe with the SNP array data
male_partner (string) – Column name representing the data for the male partner
female_partner (string) – Column name representing the data for the female partner
reference (string) – Column name representing the data for the reference
embryo_ids (list) – List of column names representing the data for 1>n embryo samples
- Returns
Dataframe summarising the number of NoCalls per sample
- Return type
dataframe
- snp_haplotyper.snp_haplotype.categorise_embryo_alleles(df, male_partner, female_partner, embryo_ids, embryo_sex, mode_of_inheritance, consanguineous)
For each embryo this fuction categorises their SNPs
Note the usable/informative genotypes for each mode of inheritance are hardcoded into this function. These have been defined with the PGD team. Note genotypes are defined as usable based on whether we can trace the inheritance from a parent AND if it’s shared by an affected/unaffected reference and NOT if it matches with the required genotype to perform a SNV analysis of that site, eg In AR condition, if performing SNV analysis would be interested in homozygous sites but these sites are not informative in this application - only heterozygous sites allow us to determine who an allele was inherited from and if it’s shared by the reference.
- Parameters
df (dataframe) – A results_df dataframe produce
male_partner (string) –
female_partner (string) –
embryo_ids (list) – List of embryo_ids matching the columns names in df
embryo_sex (list) – List of sexes in the same order as embryo_ids (required for x-linked cases)
mode_of_inheritance (string) – “autosomal_dominant”, “autosomal_recessive”, “x_linked”
consanguineous (boolean) – Boolean value indicating if the parents are consanguineous
- Returns
Dataframe with new column for each embryo annotate with a risk_category.
- Return type
dataframe
- snp_haplotyper.snp_haplotype.detect_miscall_or_ado(male_partner_haplotype, female_partner_haplotype, embryo_haplotype)
QC identify miscalls or ADOs (Allele Drop Outs)
Takes the haplotypes for the male partner, female partner and embryo and calculates whether it indicates a miscall or ADO (Allele dropout) in the embryo for that SNP.
The definition of a miscall is any haplotype in the embryo which is inconsistent with the haplotype of the parents i.e. Parents “AA”, “BB” and an embryo “AA”. This is due to technical error in the measurement. NOTE: that the miscall could be in any one of the trio even though it is recorded under the embryo.
The definition of ADO (Allele dropout) is used when there is a suspected biological origin for the mismatch in haplotypes, due to uniparental inheritance of the allele i.e Parents AA, BB and an embryo AA, the B allele has dropped out. NOTE: that the ADO could have occured in any of the trio even though it is recorded under the embryo. It is expected that the Genomic Scientist will look at the SNP plots and use their judgement as to whether allele dropout is observed.
- Parameters
male_partner_haplotype (string) – Either “AA”, “BB”, “AB”, or “NoCall”
female_partner_haplotype (string) – Either “AA”, “BB”, “AB”, or “NoCall”
embryo_haplotype (string) – Either “AA”, “BB”, “AB”, or “NoCall”
- Returns
‘call’, ‘miscall’, ‘ADO’, or an Error message
- Return type
string
- snp_haplotyper.snp_haplotype.export_json_data_as_csv(input_json, output_csv)
Import a JSON file and save the data as a CSV
Imports a simple JSON file and exports it as a CSV file. Used to import test data from JSON files (informative_snp_validation.json, embryo_validation_data.json, launch.json) and export it as human readable CSV. These CSV can be shared with Genomic Scientists during the validation process.
- Parameters
input_json (string) – The path to a JSON file
output_csv (string) – The path and filename for the output csv file
- snp_haplotyper.snp_haplotype.filter_out_nocalls(df, male_partner, female_partner, reference)
Filters out no calls
If the male partner, female partner, or reference has “NoCall” for a probeset then this probeset should be filtered out.
- Parameters
df (dataframe) – A dataframe with the SNP array data
male_partner (string) – Column name representing the data for the male partner
female_partner (string) – Column name representing the data for the female partner
reference (string) – Column name representing the data for the reference
- Returns
Original dataframe, df, with any rows where the male partner, female partner or reference has a “NoCall” filtered out
- Return type
dataframe
- snp_haplotyper.snp_haplotype.header_to_dict(header_str)
Converts a string of header_info into a dictionary :param header_info: A string in the key=value pairs like “PRU=1234;Hospital No=1234;Biopsy No:111”, where the keys will be the titles of the fields in the header :type header_info: str
- Returns
A dictionary of the header info with field titles as keys and values as values
- Return type
dict
- snp_haplotyper.snp_haplotype.main(args=None)
- snp_haplotyper.snp_haplotype.produce_html_table(df, table_identifier, include_index=False, include_total=False)
HTML table for pandas dataframe
Converts a pandas dataframe into an HTML table ready for inclusion in an HTML report
- Parameters
df (dataframe) – A dataframe which requires rendering as HTML for inclusion in the HTML report
table_identifier (string) – Sets id attribute for the table in the HTML
- Returns
HTML formated table with the provide table_id used to set the HTML table id attribute.
- Return type
String
- snp_haplotyper.snp_haplotype.snps_by_region(df, mode_of_inheritance)
Summarise the number of SNPs by regions around the gene of interest
Takes a results_df dataframe produced from either autosomal_dominant_analysis(), autosomal_recessive_analysis(), or x_linked_analysis() and counts the SNPs per “gene_distance”:
“1-2MB_from_start”, “0-1MB_from_start”, “within_gene”, “0-1MB_from_end”, “1-2MB_from_end”,
- and “snp_risk_category”:
“low_risk”, “high_risk”,
For autosomal dominant one “snp_risk_category” column is produced for where the embryo SNP is AB, for x-linked three columns are produced for where the embryo SNP is female_AB, male_AA, or male_BB, for autosomal recessive cases an “snp_inherited_from” is also added to show which partner the SNP is inherited from .
- Parameters
df (dataframe) – A dataframe produced by either autosomal_dominant_analysis(),
autosomal_dominant_analysis(), or x_linked_analysis()
- Returns
Dataframe summarising the SNPs per genome region with additional columns for each relevant haplotype in the embryo.
- Return type
dataframe
- snp_haplotyper.snp_haplotype.summarise_embryo_results(df, embryo_ids)
Summarise embryo results for each embryo in embryo_ids
- snp_haplotyper.snp_haplotype.summarise_snps_per_embryo_pretty(df, embryo_ids)
This function groups a results data frame by gene_distance and risk category, and then sums the number of SNPs in each category. It then adds a new column to the dataframe, “snp_position”, which is either “upstream”, “downstream”, or “within_gene”. Where upstream is 0-2MB from the start of the gene (5’ direction) and downstream is 0-2MB from the end of the gene in the 3’ direction.
- Parameters
df (dataframe) – A dataframe with “gene_distance” column with category values in the range: “1-2MB_from_start”, “0-1MB_from_start”, “within_gene”, “0-1MB_from_end”, “1-2MB_from_end”, and risk category columns for each embryo in embryo_ids, with category values in the range: “high_risk”, “low_risk”, “uninformative”, “NoCall”, “miscall”, “ADO”
embryo_ids (list) – A list of embryo columns in the dataframe to be summarised.
- snp_haplotyper.snp_haplotype.summarised_snps_by_region(df, mode_of_inheritance)
Summarises the SNPs per genome region for a given mode of inheritance.
snp_haplotyper.snp_plot module
- snp_haplotyper.snp_plot.filter_snps_by_catergory(df, lookup_category, embryo)
- snp_haplotyper.snp_plot.filter_snps_by_partner_sex(df, partner_sex='all')
- snp_haplotyper.snp_plot.filter_snps_by_region(df, required_region)
- snp_haplotyper.snp_plot.plot_results(df, summary_df, embryo_ids, embryo_sex, gene_start, gene_end, mode_of_inheritance)
Plots SNP data
For AD and XL produces a single plotly plot of the gene + 2mb flanking region with SNP information and summaries. For AR a faceted plot is produced further splitting the info by partner theSNP inherited from.
- Parameters
df –
embryo_ids –
mode_of_inheritance –
Returns:
- snp_haplotyper.snp_plot.summarise_snps_per_embryo(df, embryo_ids, mode_of_inheritance)
Plots SNP data
For AD and XL produces a single plotly plot of the gene + 2mb flanking region with SNP information and summaries. For AR a faceted plot is produced further splitting the info by partner theSNP inherited from.
- Parameters
df –
embryo_ids –
mode_of_inheritance –
Returns:
snp_haplotyper.x_linked_logic module
- snp_haplotyper.x_linked_logic.x_linked_analysis(df, carrier_female_partner, unaffected_male_partner, reference)
Identifies any SNP site which could be used to inform a decision regarding inheriting an X-linked condition (“informative” SNPs) and categorizes the site as indicating “high_risk” or “low_risk” of inheriting an X-linked condition for a known haplotype in the embryo (See below for details). Identifies any SNPs which are “informative” and then categorises them as “high_risk” or “low_risk” for any cases which are x-linked inheritance. # TODO change google doc link to link to read the docs. The full logic behind the function is described here TODO This function asks the question - For the combination of haplotypes present in the reference trio (reference, unaffected_partner and affected_partner) what information would this SNP provide us with if the SNP was:
AB in a female embryo AA in a male embryo BB in a male embryo
i.e. does this SNP indicate “high_risk”, “low_risk”, or is it “uninformative” in regard to inheriting the X-linked condition?
- Parameters
df (dataframe) – A dataframe containing SNP input data for both partners, reference, & embryos
carrier_female_partner (string) – Column name in dataframe refering to carrier_female_partner’s data
unaffected_male_partner (string) – Column name in dataframe refering to unaffected_male_partner’s data
reference (string) – Column name in dataframe refering to reference’s data (Always child)”
- Returns
Dataframe containing 3 new “snp_risk_category” columns, used to categorise the SNPs as “high_risk”, “low_risk”, and “uninformative” for the three different embryo catergories - female_AB_snp_risk_category, male_AA_snp_risk_category, male_BB_snp_risk_category
- Return type
Dataframe