snp_haplotyper package

Submodules

snp_haplotyper.autosomal_dominant_logic module

snp_haplotyper.autosomal_dominant_logic.autosomal_dominant_analysis(df, affected_partner, unaffected_partner, reference, reference_status, reference_relationship)

Identifies any SNP site which could be used to inform a decision regarding inheriting an autosomal dominant condition (“informative” SNPs) and categorizes the site as indicating “high_risk” or “low_risk” of inheriting an autosomal dominant condition if that site is heterozygous (AB) in the embryo. # TODO change google doc link to link to read the docs. The full logic behind the function is described here TODO This function asks the question - For the combination of haplotypes present in the reference trio (reference, unaffected_partner and affected_partner) what information would this SNP provide us with if the SNP was AB in the embryo - does the SNP indicate “high_risk”, “low_risk”, or is it “uninformative”?

Parameters: df (dataframe) – A dataframe containing SNP input data for both partners, reference, & embryos affected_partner (string): Column name in dataframe refering to affected_partner’s data unaffected_partner (string): Column name in dataframe refering to unaffected_partner’s data reference (string) : Column name in dataframe refering to reference’s data reference_status (string) : “affected” or “unaffected” reference_relationship (string) : “grandparent” or “child”
Returns: Dataframe with a “snp_risk_category” column added, used to categorise the SNPs as “high_risk”, “low_risk”, and “uninformative”
Return type: dataframe

snp_haplotyper.autosomal_recessive_logic module

snp_haplotyper.autosomal_recessive_logic.autosomal_recessive_analysis(df, male_partner, female_partner, reference, reference_status, consanguineous)

Identifies any “informative” probes because, if the embryo was heterozygous, we could identify who an allele was inherited from and if it is shared by the affected/unaffected reference. For example, if the male partner is AA and female partner AB and the B allele is also in the affected reference we could say that if the embryo was AB it must have inherited a high risk allele from the female partner. The full logic behind the function is described inthe documentation. This function asks the question - For the combination of haplotypes present in the reference trio (reference, unaffected_partner and affected_partner) what information would this SNP provide us with if the SNP was AB in the embryo - does the SNP indicate “high_risk”, “low_risk”, or is it “uninformative”? Additionally for consanguineous cases (where “informative” SNPs can be limited) it also considers additional sites in the embryo (AA,BB) to boost the number of informative SNPs. NOTE: Consanguineous cases are not currently supported.

Parameters

df (dataframe) – A dataframe containing SNP input data for both partners, reference, & embryos
male_partner (string) – Column name in dataframe refering to male_partner’s data
female_partner (string) – Column name in dataframe refering to female_partner’s data
reference (string) – Column name in dataframe refering to reference’s data
reference_status (string) – “affected” or “unaffected”
reference_relationship (string) – “grandparent” or “child”
consanguineous (boolean) – Flag indicating whether parents are consanguineous

Returns

Dataframe with a “snp_risk_category” column added, used to categorise the SNPs as “high_risk”, “low_risk”, and “uninformative”, and a “snp_inherited_from” column indicating which partner the risk is inherited

Return type

dataframe

snp_haplotyper.exceptions module

exception snp_haplotyper.exceptions.ArgumentInputError

Bases: snp_haplotyper.exceptions.Error

Raised when the input parameters do not make sense considering the underlying biology

exception snp_haplotyper.exceptions.Error

Bases: Exception

Base class for other exceptions

exception snp_haplotyper.exceptions.InvalidParameterSelectedError

Bases: snp_haplotyper.exceptions.Error

The config.py has a flag prohibiting the script from running with the selected parameters. This is to prevent the user from running the script for options which have not yet been validated.

snp_haplotyper.sample_sheet_reader module

snp_haplotyper.snp_haplotype module

snp_haplotyper.snp_haplotype.add_embryo_sex_to_column_name(html_string, embryo_ids, embryo_sex)

Annotated any table with with embryo data with the sex of the embryos

Parameters

html_string (string) – A HTML formated table with embryo ID column names
embryo_ids (list) – A list of embryo IDs
embryo_sex (list) – A list of embryo sexes coressponding to the embryo ids

Returns

HTML formated table with the column headings annotated with the embryo sex.

Return type

String

snp_haplotyper.snp_haplotype.add_rsid_column(df, affy_2_rs_ids_df)

Provides dbsnp rsIDs

New column created in the dataframe, df, matching the probes_set IDs to dbSNP rsIDs.

Parameters

df (dataframe) – A dataframe with a “probeset_id” column
affy_2_rs_ids_df (dataframe) – A dataframe with columns “probeset_id” & “rsID” used to map between the 2 identifiers

Returns

Original dataframe, df, with columns for “rsID” added next to the “probeset_id” column (these columns are now the 1st columns of the dataframe)

Return type

dataframe

snp_haplotyper.snp_haplotype.annotate_distance_from_gene(df, chr, start, end)

Annotates the probeset based on the provided genomic co-ordinates

New column created, “gene_distance”, in the dataframe, df, annotating the region the SNP is in. SNPs allocated to “within_gene”, “0-1MB_from_start”, “1-2MB_from_start”, “0-1MB_from_end”, and “1-2MB_from_end”,

Parameters

df (dataframe) – A dataframe with a “probeset_id” column and the feature’s genomic co-ordinates, “Position”
chr (string) – The chromsome the gene of interest is on
start (int) – The start coordinate of the gene (1-based)
end (int) – The end coordinate of the gene (1-based)

Returns

Original dataframe, df, with “gene_distance” column added characterising the probeset in relation to the gene of interest

Return type

dataframe

snp_haplotyper.snp_haplotype.annotate_snp_position(df)

For a dataframe with a “gene_distance” column this adds a “snp_position” column. This is useful for summarising data in the column,

Parameters: df (dataframe) – A dataframe with “gene_distance” column with category values in the range: “1-2MB_from_start”, “0-1MB_from_start”, “within_gene”, “0-1MB_from_end”, “1-2MB_from_end”,
Returns: Dataframe with new column “snp_position”, with the category values “upstream”, “within_gene”, and “downstream”.
Return type: dataframe

snp_haplotyper.snp_haplotype.calculate_nocall_percentages(df)

Calculate the percentage of nocalls

Takes a dataframe produced from calculate_qc_metrics() and calculates the % of nocalls per sample.

Parameters: df (dataframe) – A dataframe produced by calculate_qc_metrics()
Returns: Dataframe summarising the % of NoCalls per sample, with the same format as the dataframe produced by calculate_qc_metrics() Essentially provides a row for the % of NoCalls per sample
Return type: dataframe

snp_haplotyper.snp_haplotype.calculate_qc_metrics(df, male_partner, female_partner, reference, embryo_ids)

Calculate QC metrics based on the number of NoCalls per sample (measure of DNA quality)

Calculate QC metrics based on the number of NoCalls per sample which can be used as a metric of DNA quality.

Parameters

df (dataframe) – A dataframe with the SNP array data
male_partner (string) – Column name representing the data for the male partner
female_partner (string) – Column name representing the data for the female partner
reference (string) – Column name representing the data for the reference
embryo_ids (list) – List of column names representing the data for 1>n embryo samples

Returns

Dataframe summarising the number of NoCalls per sample

Return type

dataframe

snp_haplotyper.snp_haplotype.categorise_embryo_alleles(df, male_partner, female_partner, embryo_ids, embryo_sex, mode_of_inheritance, consanguineous)

For each embryo this fuction categorises their SNPs

Note the usable/informative genotypes for each mode of inheritance are hardcoded into this function. These have been defined with the PGD team. Note genotypes are defined as usable based on whether we can trace the inheritance from a parent AND if it’s shared by an affected/unaffected reference and NOT if it matches with the required genotype to perform a SNV analysis of that site, eg In AR condition, if performing SNV analysis would be interested in homozygous sites but these sites are not informative in this application - only heterozygous sites allow us to determine who an allele was inherited from and if it’s shared by the reference.

Parameters

df (dataframe) – A results_df dataframe produce
male_partner (string) –
female_partner (string) –
embryo_ids (list) – List of embryo_ids matching the columns names in df
embryo_sex (list) – List of sexes in the same order as embryo_ids (required for x-linked cases)
mode_of_inheritance (string) – “autosomal_dominant”, “autosomal_recessive”, “x_linked”
consanguineous (boolean) – Boolean value indicating if the parents are consanguineous

Returns

Dataframe with new column for each embryo annotate with a risk_category.

Return type

dataframe

snp_haplotyper.snp_haplotype.detect_miscall_or_ado(male_partner_haplotype, female_partner_haplotype, embryo_haplotype)

QC identify miscalls or ADOs (Allele Drop Outs)

Takes the haplotypes for the male partner, female partner and embryo and calculates whether it indicates a miscall or ADO (Allele dropout) in the embryo for that SNP.

The definition of a miscall is any haplotype in the embryo which is inconsistent with the haplotype of the parents i.e. Parents “AA”, “BB” and an embryo “AA”. This is due to technical error in the measurement. NOTE: that the miscall could be in any one of the trio even though it is recorded under the embryo.

The definition of ADO (Allele dropout) is used when there is a suspected biological origin for the mismatch in haplotypes, due to uniparental inheritance of the allele i.e Parents AA, BB and an embryo AA, the B allele has dropped out. NOTE: that the ADO could have occured in any of the trio even though it is recorded under the embryo. It is expected that the Genomic Scientist will look at the SNP plots and use their judgement as to whether allele dropout is observed.

Parameters

male_partner_haplotype (string) – Either “AA”, “BB”, “AB”, or “NoCall”
female_partner_haplotype (string) – Either “AA”, “BB”, “AB”, or “NoCall”
embryo_haplotype (string) – Either “AA”, “BB”, “AB”, or “NoCall”

Returns

‘call’, ‘miscall’, ‘ADO’, or an Error message

Return type

string

snp_haplotyper.snp_haplotype.export_json_data_as_csv(input_json, output_csv)

Import a JSON file and save the data as a CSV

Imports a simple JSON file and exports it as a CSV file. Used to import test data from JSON files (informative_snp_validation.json, embryo_validation_data.json, launch.json) and export it as human readable CSV. These CSV can be shared with Genomic Scientists during the validation process.

Parameters

input_json (string) – The path to a JSON file
output_csv (string) – The path and filename for the output csv file

snp_haplotyper.snp_haplotype.filter_out_nocalls(df, male_partner, female_partner, reference)

Filters out no calls

If the male partner, female partner, or reference has “NoCall” for a probeset then this probeset should be filtered out.

Parameters

df (dataframe) – A dataframe with the SNP array data
male_partner (string) – Column name representing the data for the male partner
female_partner (string) – Column name representing the data for the female partner
reference (string) – Column name representing the data for the reference

Returns

Original dataframe, df, with any rows where the male partner, female partner or reference has a “NoCall” filtered out

Return type

dataframe

snp_haplotyper.snp_haplotype.header_to_dict(header_str)

Converts a string of header_info into a dictionary :param header_info: A string in the key=value pairs like “PRU=1234;Hospital No=1234;Biopsy No:111”, where the keys will be the titles of the fields in the header :type header_info: str

Returns: A dictionary of the header info with field titles as keys and values as values
Return type: dict

snp_haplotyper.snp_haplotype.main(args=None)

snp_haplotyper.snp_haplotype.produce_html_table(df, table_identifier, include_index=False, include_total=False)

HTML table for pandas dataframe

Converts a pandas dataframe into an HTML table ready for inclusion in an HTML report

Parameters

df (dataframe) – A dataframe which requires rendering as HTML for inclusion in the HTML report
table_identifier (string) – Sets id attribute for the table in the HTML

Returns

HTML formated table with the provide table_id used to set the HTML table id attribute.

Return type

String

snp_haplotyper.snp_haplotype.snps_by_region(df, mode_of_inheritance)

Summarise the number of SNPs by regions around the gene of interest

Takes a results_df dataframe produced from either autosomal_dominant_analysis(), autosomal_recessive_analysis(), or x_linked_analysis() and counts the SNPs per “gene_distance”:

“1-2MB_from_start”, “0-1MB_from_start”, “within_gene”, “0-1MB_from_end”, “1-2MB_from_end”,

and “snp_risk_category”:
“low_risk”, “high_risk”,

For autosomal dominant one “snp_risk_category” column is produced for where the embryo SNP is AB, for x-linked three columns are produced for where the embryo SNP is female_AB, male_AA, or male_BB, for autosomal recessive cases an “snp_inherited_from” is also added to show which partner the SNP is inherited from .

Parameters: df (dataframe) – A dataframe produced by either autosomal_dominant_analysis(),

autosomal_dominant_analysis(), or x_linked_analysis()

Returns: Dataframe summarising the SNPs per genome region with additional columns for each relevant haplotype in the embryo.
Return type: dataframe

snp_haplotyper.snp_haplotype.summarise_embryo_results(df, embryo_ids): Summarise embryo results for each embryo in embryo_ids

snp_haplotyper.snp_haplotype.summarise_snps_per_embryo_pretty(df, embryo_ids)

This function groups a results data frame by gene_distance and risk category, and then sums the number of SNPs in each category. It then adds a new column to the dataframe, “snp_position”, which is either “upstream”, “downstream”, or “within_gene”. Where upstream is 0-2MB from the start of the gene (5’ direction) and downstream is 0-2MB from the end of the gene in the 3’ direction.

Parameters

df (dataframe) – A dataframe with “gene_distance” column with category values in the range: “1-2MB_from_start”, “0-1MB_from_start”, “within_gene”, “0-1MB_from_end”, “1-2MB_from_end”, and risk category columns for each embryo in embryo_ids, with category values in the range: “high_risk”, “low_risk”, “uninformative”, “NoCall”, “miscall”, “ADO”
embryo_ids (list) – A list of embryo columns in the dataframe to be summarised.

snp_haplotyper.snp_haplotype.summarised_snps_by_region(df, mode_of_inheritance): Summarises the SNPs per genome region for a given mode of inheritance.

snp_haplotyper.snp_plot module

snp_haplotyper.snp_plot.filter_snps_by_catergory(df, lookup_category, embryo)

snp_haplotyper.snp_plot.filter_snps_by_partner_sex(df, partner_sex='all')

snp_haplotyper.snp_plot.filter_snps_by_region(df, required_region)

snp_haplotyper.snp_plot.plot_results(df, summary_df, embryo_ids, embryo_sex, gene_start, gene_end, mode_of_inheritance)

Plots SNP data

For AD and XL produces a single plotly plot of the gene + 2mb flanking region with SNP information and summaries. For AR a faceted plot is produced further splitting the info by partner theSNP inherited from.

Parameters

df –
embryo_ids –
mode_of_inheritance –

Returns:

snp_haplotyper.snp_plot.summarise_snps_per_embryo(df, embryo_ids, mode_of_inheritance)

Plots SNP data

For AD and XL produces a single plotly plot of the gene + 2mb flanking region with SNP information and summaries. For AR a faceted plot is produced further splitting the info by partner theSNP inherited from.

Parameters

df –
embryo_ids –
mode_of_inheritance –

Returns:

snp_haplotyper.x_linked_logic module

snp_haplotyper.x_linked_logic.x_linked_analysis(df, carrier_female_partner, unaffected_male_partner, reference)

Identifies any SNP site which could be used to inform a decision regarding inheriting an X-linked condition (“informative” SNPs) and categorizes the site as indicating “high_risk” or “low_risk” of inheriting an X-linked condition for a known haplotype in the embryo (See below for details). Identifies any SNPs which are “informative” and then categorises them as “high_risk” or “low_risk” for any cases which are x-linked inheritance. # TODO change google doc link to link to read the docs. The full logic behind the function is described here TODO This function asks the question - For the combination of haplotypes present in the reference trio (reference, unaffected_partner and affected_partner) what information would this SNP provide us with if the SNP was:

AB in a female embryo AA in a male embryo BB in a male embryo

i.e. does this SNP indicate “high_risk”, “low_risk”, or is it “uninformative” in regard to inheriting the X-linked condition?

Parameters

df (dataframe) – A dataframe containing SNP input data for both partners, reference, & embryos
carrier_female_partner (string) – Column name in dataframe refering to carrier_female_partner’s data
unaffected_male_partner (string) – Column name in dataframe refering to unaffected_male_partner’s data
reference (string) – Column name in dataframe refering to reference’s data (Always child)”

Returns

Dataframe containing 3 new “snp_risk_category” columns, used to categorise the SNPs as “high_risk”, “low_risk”, and “uninformative” for the three different embryo catergories - female_AB_snp_risk_category, male_AA_snp_risk_category, male_BB_snp_risk_category

Return type

Dataframe

snp_haplotyper package

Submodules

snp_haplotyper.autosomal_dominant_logic module

snp_haplotyper.autosomal_recessive_logic module

snp_haplotyper.exceptions module

snp_haplotyper.sample_sheet_reader module

snp_haplotyper.snp_haplotype module

snp_haplotyper.snp_plot module

snp_haplotyper.x_linked_logic module

Module contents