#144 A new format experiment: Tales of DevJourney
Transcript
⚠ The following transcript was automatically generated.
❤ Help us out, Submit
a pull-request to correct potential mistakes
Tim Bourguignon 0:12
Hello, everyone, this is Tim speaking and welcome to developer's journey. There's a new format I wanted to try for a while, something I could call tales of developer's journey. You know, when you hear somebody's story, it's one thing to really hear it from the beginning, all the way to the end or to the current time. But in there, you always find some interesting stories. And those stories are sometimes well worth a tell on their own. So it's interesting to get the whole story and to know really what people went through. But it's also entertaining to get the stories, one, just extracted from the context. So today, I'm going to tell one of my stories, and, and I hope you will tell me afterwards how you liked it. And if I should try and get some guests, maybe former guests from the show, or maybe new guests from the shuttle, to tell me some of those stories in this campfire format.
Tim Bourguignon 1:28
My story rolls me back a few years, something like 12 or 13 years, maybe I was working for a big German healthcare company back then. We were working on big machines, linear accelerators to cure cancer. And, by the way, that's one of the most thrilling jobs I had, you cannot fake going into a hospital to see how your machine is being used. And see the patient's going through and see how some changes you do help technicians help their patients this was this was absolutely mind blowing. And I was responsible for the daikon communication of the machine. So getting the images out of the machine, and onto a third party system. And interfacing with this third party system was a pain in the butt. First of all, the company was really not willing to let us interface with them. From from a business standpoint, they wanted it, but we were competitors. And so they didn't want to let us have it union. So I had a few workshops in Germany and in the Bay Area with them, really trying to to make the best out of this political mess. And, and after some some, after a while, I got some some reports from our testers who were doing a real life testing that something wasn't right. So we had in our lab, where I was working from, we had some machines running tests and tests and tests the whole day, simulating for ring radiations at the patients and doing going through the motions and trying to, to pretend treating patients. And the reports that got were that the the network communication between our machine and this third party system was failing. And as somehow the pictures were not going through. And this was a very weird bug because it was really sporadic. It didn't happen all the time. And so I was asked by my boss to get there and, and observe this in the field. So when my rented a car and drove for a while in the German countryside, I remember in the car, there was GPS and the GPS at some point said continue straight ahead for a very long time.
Tim Bourguignon 4:34
Literally for a very long time. It's not what I was used to. I was used to 13.5 kilometers. No, there was a continuous straight ahead for a very long time. And when I arrived, Eric was first of all mastica to get through a lot of security because you have to play with radiation and so it's it's it's very regulated. And I came up to the the bunkers. They're called by bunkers, and concrete bunkers with big big metal doors with the machines in there just to be sure that no radiation is coming out. And you have to wear radiation badges and your safety shoes and, and the lab coat and everything. And I said the machine and started doing testing rounds. And first testing round doesn't show anything special. And last something like 10 minutes, and then the second round, again, nothing special. And we'll go through the whole thing for one day without seeing anything special. And I would just get all the logs I could and take them back with me and the next day, start looking at the logs and see if I see anything special. And of course I didn't. And on the race the next day, I got a call from the technicians saying and doing running the test again and say it happened again. And so I would really go drive back straight ahead for a very long time. And, and find my ways, the bunkers and start testing again and again and again. And I did this this game for days, going in and out. Sometimes I got to reproduce it once and then I will cherish those logs I got trying to get everything out of the machine I could and and drove back and started analyzing all this. And I couldn't figure it out. I just couldn't figure it out. Sometimes, or most of the times 80% of time, 90% of the time, 95% of the time, all the pictures were going through, and nothing was happening. And sometimes the pictures wouldn't go through. There was no nothing on the network. The machine was saying everything's fine. And I just couldn't see why it was failing. I just so there was a timeout somewhere. But I couldn't figure out why. At some point, I still managed to isolate the problem and say, Okay, this is not a problem coming from our side. This is we are really behaving the way we should. From observant locks. Of course, I couldn't fire up a development machine there. I couldn't get really to debug all this would have been way overkill. I could elevate the log levels and see more logs. But of course, we were worried that we would influence the machine by doing so. And that's probably what hit the problem for a while. By doing so, or machine was slower. And thus, the problem wasn't occurring. The pictures were going through. And a few months down the line, I still hadn't figured out the thing. I went on vacation. And when I came back, or project manager, his name was the end of the week, the only native speaking English person on the project, and I could see him roll his eyes when we're talking and saying, Oh, god, this is so awful English. But But he he bear with us. And so I came back from vacation. And Andy told me Hey, I found it. I said what I found the bug. And I couldn't believe it. And that's when I learned that there is a thing called CPU affinity in Windows. And whoever installed this third party software had locked up the CPU affinity of this software onto one CPU. So basically, it was an eight core machines with I don't know how many gazillions Ram. But still, the software was using one core only. and in this
Tim Bourguignon 9:15
situation, when we're throwing a lot of pictures across the network, the third party software was getting the pictures on the other side and doing some magic and in indexing them and reverse engineering. I don't know what they were doing. But they were doing something CPU intensive when they got some pictures. And if you hit that software at the wrong moment, then you would start getting some timeouts because he couldn't handle getting pictures and processing them fast enough and at some point was through a timeout and to this day, I I'm still shaking tough not Having found this thing of not having fired up process monitoring on that third party machine, I was focused on our machine, I was focused on our testing steps, and I never thought to test the other side. But that's the way it is. You have to own your failures. And that was mine. But I learned, I really learned a lot. So I guess it's good. And that's my story for today. Let me help you liked it.
Tim Bourguignon 10:42
Let me know if I should invite some guests to get some stories like this 10 minutes out of their context, just to remember what it is to be a software developer in the world and have to face real problems. Well, you know where to find me. I'm on Twitter @timothep, or find me on devjourney.info. Thank you for your support. Have a good time. Bye.