There was a relatively technical question in the chat from... Is that what I'm pronouncing that correctly? Also questions from Mitchell as well. Ah, yeah, yeah. Mitchell... Okay, how much would you think that filtering steps bias the expression levels for each of these genes? I think for, as a whole, if you're doing just general transcriptomics with the human genome, it's probably not going to really drastically bias your general results. But at the individual level, I definitely think it will affect transcript-level quantification, just because if you're using short-read data, your expression quantifications can be quite accurate. But that kind of depends on actually knowing the full set of transcripts that are expressed. and then mitchell mentioned it's not a question just just kind of a comment that there are some gotchas in looking at positional changes because the the gene families you're mentioning have identical coding copies okay do you think it could be like identical coding sequences or even identical like the nucleotide sequence is identical nucleotides nucleotide identical within the coding part of the gene. Is that true for all of them or just most of them? All of those, I have seen examples in at least one individual. I think most individuals for those families have truly identical. There's probably many more examples than that. That's just what I can remember off the top of my head. I think there's probably about 15 families that have this issue very consistently. Okay, I'll send you an email because I think that's also good quality control because if, yeah, again, like if my pipeline is saying one thing, but it's not lining up with what people know, like that's important to consider. I mean, I think reads that have very low mapping quality to the given transcript, so they could have gone here, they could have gone there, and there are quite a few. We all know there are lots of examples of that are not very interesting. But I think the interesting cases are you map it to the reference. It says, I go here with some confidence. You map it to your personalized reference. It says, no, no, I go over here with some confidence. And I think that looking and diagnosing those cases where personalized variation does actually influence is informative, right? And at the margin, it has to affect expression estimates to some degree. But I totally agree with Mitchell. There are going to be cases where there's nothing you can do unless you have some tag that told you not specifically which copy it was coming from you can't do something about it yeah I think that there's kind of similar to the pattern or probably overlapping that there's like genes that are duplicated in every well all the haplotypes of HPRC except for the reference and if those duplicated genes have at least some nucleotide differences maybe that would be picked up yeah I always mention this positive can control the G print too is a positive control for that experiment. Okay, can you uh positive control for it's there's an additional copy in everybody else except for and is there more coding differences that you can pick up? Okay yeah that's a mis-annotation or mis-assembly in G38 I think. Oh, I appreciate it was actually an error. I just thought it was a rare that's cool. I'm not totally positive about that. If you look at like copy number estimates, every once in a while you can find like short read copy number estimates. Every once in a while you can find somebody who looks like they're single copy, but there's enough error in the copy estimation pipeline that I don't know whether to believe it's fixed or not. Fair enough. yeah well thank you so much everybody and for people reaching out regarding the weird genes uh i'll definitely send those out and then it would be great to get your feedback on if i'm doing an error on my end or there's like i'll say a subset i've definitely looked at igb and in the mini graph cactus variation and like it's it seems that it's present um but it would be nice to get some some feedback All right, noting the time, do you want to go? Here we go. Thank you everyone for joining. As Benedict said, we're going to sort of split this over two calls. This is the Penn Epigenome Panel that we've sort of put together. I'm not actually going to go over the whole thing since I sort of need to give some context as to what the different pieces are. So we'll go through each part. And so I wanted to first give an overview of what all analyses that we've been doing, particularly on the methylation side. There is a graph-based analysis of the growth of CPGs in the pan-epigenome, which I'll go over. we've also had to do you know quite a bit of data processing analyses and really like a great deal of QC work has been done so I'm going to talk a little bit about this primarily focusing on the PAC bio data howeverthere's no way I could actually fit this into this call and so we've had a very nitty-gritty presentation about the methylation QC with the UCSC team, which has been recorded, and I can ask Malin to make that available. And we're also meeting again today at 2 p.m. if anyone wants to see the nitty-gritty, 2 p.m. central, the nitty-gritty of the methylation QC and how we've gotten to the point that we have. I want to note within this that we are doing sort of phasing. So the result of much of our methylation analysis is on individual assembly coordinates of methylation annotation, which we've then woven into the graph structure. So the pan-genome has methylation annotations, essentially. We've done other modeling work within that. And then we've done proper analysis of methylation as a function of ancestry and just looking for global trends. And so the first thing is we wanted to look and see how is it that the methylation-receiving sequence space is expanding within the pan-genome. And I want to note that this work is pretty much mainly driven by Christian within our little team. And so the first figure I'm showing here shows it's a panacea spot. So it's showing the cumulative number of CPGs and millions being added as the pan genome is essentially constructed. So as you add the symbol, as you add DGs, and we stratify by allele frequency with, you know, sort of fixed common and rare CPGs. We then want to see where are these CPGs essentially that are being added with respect to the reference, where are they occurring and what sort of features. So over here, this upset plot shows the number of CPGs that are non-referenced in TEs, satellites, islands, meaning CPG islands, promoters, and seg dupes. It's worth noting that the CPG islands are not masked. and so to give a context as to how the growth of the cg dinucleotide compares to all other dinucleotides we also look to see what the percent growth of that dinucleotide class was you know relative to the reference in this case chm13 is the reference and so we see two things. The Cg dinucleotide is of course very depleted in CHM13. This is something that is well known and expected, right? But maybe less expected is that the percent growth for the Cg dinucleotide is much higher and does not follow this nice linear relationship in terms of growth of the size of the dinucleotide in the reference relative to how much it expands as you add additional individuals. And so as you add individuals, you're adding many more CGs than you would expect relative to other dinucleotide classes. And so the reason... Oh, it's not showing it. The reason that we sort of expect this, and hopefully I can just talk through it, is basically that methylated cytosines tend to deaminate, which means that a cytosine will turn into a thymine. So if it's being methylated, it's more likely to actually lose the cytosine and become a thymine. And so what we think is essentially happening here is you have some ancestral genome within which a new block of CPGs was added through some mutational event. And then over evolutionary time, the ones that are methylated tend to be deaminated outside of CPG islands where CPG islands are being protected. And because this process is stochastic, that means that as we are sampling individuals across, you know, the genetic search base of humanity, what we're doing is we're picking up ancestral CPGs that were not removed through this stochastic deamination process. so you end up with a lot more cpgs because of that because it's not guaranteed that every you know a cg that was uh lost in one population is going to be lost in every other population even if you have that deamination mechanism and so what we think basically we can say from this is that we're seeing expansion of cg space uh in in te satellites and seg dupes which are essentially adding new CGs, and that's being balanced by this stochastic removal of cytosines by deamination. And so you have sort of this balance occurring over evolutionary time. and so that's fine so the let me see okay sorry so given all of that we need to actually look at the methylation so far all I've shown you is how the CG space is varying in the pan genome oh there we go um but there are some challenges to using dna methylation uh analyzing dna methylation with the the data that we have um you know these are different technologies so we have pac bio and ont we have to somehow square those two with each other we have different platforms sql2 versus revio these data were generated at different sites at different points in time And there are definitely some cell line differences that you can't really easily control for. So, you know, we're going to see some differences in passage. We're going to see some differences in the effect of EBV transformation. And just environmental differences are going to add some noise to all of this. And these data were processed in slightly different ways. So we needed to try and eliminate as many of those variables as we can. And unfortunately, you know, the first four things we can't really do anything about, right? We can characterize what we think is there, but we cannot remove them without making some statistical assumptions that we don't think we want to make as we're sort of providing this as a resource. So we'll sort of characterize what they need to look like and what effects may be there, but ultimately let the edge user decide whether or not they would like to try to residualize things out, for instance. But what we can do and what we did was to try to process everything in as consistent of a way as possible. So instead of some of them being processed with primrose for PacBio and others being processed with jasmine, we instead process everything with the same versions of everything as we're able to. And so for the methylation QC analysis, I may not be focusing on PacBio, but I'll talk about the ONT a bit as well. and so the the first sort of height level qc we wanted to do is just to look at how dna methylation is is is uh how samples are clustering based on dna methylation sort of agnostic of a phase so this is taking the uh the data once we've reprocessed everything for PacBio, aligned it against CHF13, called methylation with PVCPG tools, and then we binned the genome into three sort of spaces. We looked at bins only as CPG islands, and so these are masked CPG islands. We also look at the spaces that are in between CPG islands, so CPGCs, if you will. And then we also took a uniform essentially gridding across the genome. So it's just three different ways to slice up the genome to see how samples are clustering with respect to each other. And what we see is here I'm showing each point is an individual. On this spot I coding them by sex And this is the principal components analysis The first column is when we bin the genome to 50k bins The middle column is focusing on the CPG islands, and the right column is focusing on the CPGCs, so the space between CPG islands. The first row is PC1 versus PC2, and the bottom is PC2 versus PC3. most you know the vast majority of variants is captured between pc1 and pc2 and what we don't see really is the sex difference right you know by eye there's not a clear difference maybe the noils males have a little bit more overall variability but you know no clear segregation by sex. Likewise, if we color code the same data by population label, here we can't use PCLI, obviously, because it's sort of genome-wide. And so I'm just using these population levels. And what you should see is that for the most part, across all three contexts, we do not see separation by population label. With the exception of a little bit in the 50 KB, but not so much, but especially in the CPG islands context, we see that a subset of the Japanese samples that are all male seem to stand out. They are outliers. But we don't see that outside of CPG islands. And so by looking at methylation sort of genome-wide in these three different ways to slice it up, we can sort of conclude that, you know, methylation may be, you know, shifting sort of globally, right? You know, an individual, and I have other analysis that I don't have time today to sort of show you this, methylation may sort of shift up and down across the genome for a given individual. but within cpg islands it's the variation is much tighter right so because these are spaces that are are tending to be regulated we don't see a strong sex differences but males do seem to have somewhat of a higher variability but i think that's largely to do with that subset of japanese samples then there's no real population differences that we can see globally. Now, the subset of Japanese samples that are outliers, they do most strongly stand out in the CPG islands. However, funny enough, if we narrow this further and we look at promoters, we see that that effect goes away. Benedict? Just curious, the Japanese samples, right, were they the ones that were sequenced by the Japanese group? So there were 10 samples, right, that were not sequenced by the same, essentially by the HPRC. And I'm just wondering if there's a technical artifact going on. So we're trying to sort that out. So part of it, I think, is definitely technical, because they are all SQL 2 based, right? So they're samples that have essentially no Revio data for them. And so that's going to, whereas almost all samples have a mixture of SQL 2 and Revio, and that's going to have an effect. But there's also the challenge that the cells they use are physically separate, right? It wasn't the same pool of cells. And so it's hard to be certain that it's purely technical. We've gone back with their help and we've reprocess those samples as best we're able to and we're gonna try to look again and see um you know if if we uh remethylation calling we uh base modification calling with kinetic tags intact essentially makes it more like the other samples but if not you know either it is something technical like purely technical because they should know that the ont data um they are not outliers for. Yes, Karen? You might have just answered my question, but I was going to say that we did duplicate the ONT data. So we had two sets. One was generated from HPRC and one was generated from the Japanese team. I don't know how separate they are, but that could be a good benchmark if you suspect it's a clonal thing going on on their end, but it sounds like that's not the case. Well, so actually I have a question then for that. So the ONT data that you guys generated, was that the same pool of cells that they used to generate the PEC bio? Because my suspicion is that what we're seeing is not actually practical in the analysis sense. I think it's just the cells were slightly different. The question is that there should be two separate ONT experiments. They have double the ONT coverage as every other cell line. One of their ONT experiments came from that batch of cell lines, and one of the experiments from ONT came from a separate HPRC batch of cells. And so in that case, I just don't know if they're the same. But they're two independent experiments pointing to two different cell passaging. Yeah. Okay. I will follow up with you to get those details so we can make the appropriate comparisons. It's fantastic. Just a reminder, though, we would be pleased if this was a technical artifact, right, rather it would be slightly weird to see the 10 Japanese samples actually truly acting differently. I mean, I can't see any argument for why that would be the case, right? Well, if something happened in the environment the cells were in, right? So the relationship environment is really responsive. I would consider that to be a technical artifact. Oh, you were saying, through visuals or at some point during the passage of the cells? At some point in the passage of the cells. So, yes, if that's how you define technical, yes. I mean, technically, did my code do something wrong? Yeah, I appreciate you knowing that. Yep. Perfect. Thank you. Okay. So, yeah, so if we, same data, but we plot it based on the radio percentage you see again it's the it's specifically the subset of the japanese samples that are entirely sequel to right okay so now to because what we wanted to do is of course we wanted to take these uh methylation calls uh on assembly uh individual assembly coordinates and be able to weave them weave them into the the fn genome graph so we we put together a phasing algorithm for doing this because we of course need to phase the reeds and of course given the high quality of these assemblies we want to take advantage of that to do the phasing and so this is our our overall approach again if you want to get into the weeds with me and the kisi come to the talk at two but basically the idea is we take the whole set of donor reeds we align them to the maternal assembly or the HAP1 assembly, and we align the same set of reads against HAP2. And what we do is taking the alignments to HAP1 and the alignments to HAP2, we then load the HAP1, essentially, alignments into memory, and then we stream through the HAP2 alignments. And for each read, here we are discarding any reads that have supplementary alignments, and we're discarding any reads, obviously, that are unmapped. And so for each read, we compare the alignment in HAP1 and the alignment in HAP2, looking at the MAPQ score. If the MAPQ score for HAP2 is greater than HAP1, we assign it to HAP2, right? If it is not greater than HAP1 and they are, say, equal, then it will fall back to look at the match rate over the match, mismatch, deletions, and insertions rate, so the MG tag. If they are not equal, meaning that HAP1 has a better alignment score, a better MG tag, then we assign it to HAP1. If it's the other way, we assign it to HAP2, and so on. And so we run into a tie, right? when we have a tie what we do is we initialize the run with a flipper which starts at false and every time a tie comes up it will you know essentially assign that that tie to one of the haplotypes and then flip the flipper so that the next time a tie comes up it assigns it to the other and that way we're assigning evenly to the two haplotypes when the matq and the mg tag cannot distinguish them. And so given those phased alignments now aligned against individual assemblies we looked at alignment metrics so alignment rates mismatch rate things like that and so we do see some phase asymmetry in the alignment you see where basically we tend to see that HAP2 has slightly lower alignment quality than alignment 1 but it's quite small Thank you. Thank you. All right. Thank you. I don't know if you can see it, but the coverage is just much more reliable. So that just kind of makes it a little bit more easier to have coverage over some thresholds. And it does look like HP 30097 might be a likely outlier in terms of like the land quality. So she's doing a little bit of this one already. and then by changing the stuff, so we did a bunch of different tests to see whether or not phasing looks good. And we see that there is a degree of symmetry in the phasing, where we've sort of narrowed it down to be potentially a sort of a possibility of agents who are in those regions. So in some places, with things that you can really kind of have parts of time, we're not being asymmetrical in terms of the being decided, but the observed errors of the perspective, the alignment, is slightly different. or possibly in terms of what content is scientific on non-sex if one of those is more likely to get more rich or perhaps more cognitive And one way to sort of look at this is to look at it in a different way. It's a chance of your imagination being able to do one viable. And if the black bar is here, the way you're presenting the results of it, it ought to be this track. That's a really good idea. This is what the methylation is. So the hydrogen is hydrogen methylation. If you look at the first body position the other half of the body is individual is people with psychometallic psychometallic and we see the metal lid in my hand I have a metal lid in my hand And you can see and we see the same thing I sorry I don see it I see 5 We see 3-0-1. We see Q-1. We see Q-2. And I'm not sure what the option is. But I'm sure that's the only way. And I'm sure that's the only way. And I mean, we all see it. We all see it. And we all see it. Thank you. Thank you. Okay, we'll play music. You go to the music. Snow and stars Down the street And then Ten minutes Up We suspect It's the one more time I hope you'll leave The storm for me Just try We can say I love you Till the morning Back Tell me what I gotta do to make this right. I want to be wild. I want to be wild. I want to be wild. I want to be wild. Thank you. Thank you. I want to take you higher, higher. I want to take you higher. I want to take you higher, higher. Okay, we will stop. Okay, we will stop. Thank you. Thank you. Okay, we'll stop. Yeah. Can you hear me now? Yeah. Okay. You're talking to OVNESS? Yeah. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. What do you think of this To me it a mind It the basic fact that I already using the agents to do exactly what I supposed to do This is terrible. That would be powerful. When organized through a work of art, writing an application, and doing research. Oh my God. They are so strong. If I see you, I will be not around. Please take me with you so we can be. Our love, our beauty, our genius, our worth, our triumph, our love. This is strong, I'm sure. Now, I'm sure that other examples exist, but are you familiar with anyone, Like, can you do searching? Do deep research into this on the web. Try to look for other examples of things that are like this, that are happening right now, that people are really able to use. I don't think there's anything even remotely comparable to the level of power that we're able to wield with graph work. Thank you. I got to go now to where I want to go. Today we're walking way to a different world, baby. Out there in the dark with your crazy, you know you're the one making me crazy. Thank you. Thank you. © transcript Emily Beynon © transcript Emily Beynon Thank you. © transcript Emily Beynon © BF-WATCH TV 2021 © transcript Emily Beynon © transcript Emily Beynon Thank you. © transcript Emily Beynon © transcript Emily Beynon Thank you. © BF-WATCH TV 2021 © BF-WATCH TV 2021 Thank you. Thank you. Thank you. Thank you. Thank you.Oh, no.

