00:00
This material is made available to you by on behalf of the university of Melbourne under section one month three p of the copyright act nineteen sixty eight it may be subject to copyright for more information visit the university copyright website okay good afternoon let's make a start then hopefully you can hear me okay down the back all right okay thanks all alright so much turn down a bit see how we go announcements project you would have seen has been。Released on Wednesday I'LL say a little bit more about that at the the end of the lecture see how we go the time plan plan today。
01:10
I want to just wrap up where we were on recommend recommend systems。Start looking at visualization so we're going to look at a few things in in visualization we'talk a bit about the project and and and maybe look at something topical as well in AI。So let's Switch to the previous slide pack from。Last lecture and we were。We were talking about about recommended recommended systems and we we had this scenario where we'VE got a data set and we need to fill in fill in one of the blank the blank entries we don't know what it is we'd like to make a make an inference because。
02:00
Perhaps we want to make a recommendation to to somebody based on what we think that rating would be in the case of in the case of a movie。So we looked at a couple of ways of doing it we looked at the ways that you。You choose the people who are similar similar to you。And fill in the rating based on them based on the people who are like you um we fill it in or we we do it another way we we choose the the items that are similar to use so the example I use with some movie aquaman we choose other movie that are similar to that and we fill in the missing rating based on that。Natural question is what if we do both at once so looking at this。You can either look at the row or the columns to fill it in what if you try and look at both of them to together is is the intuition it gets it gets more complicated um quite a bit more complicated in fact I'm just going to spend five minutes giving you a flavor of that that complicated's not examinable it's just for for interest um um it's ah it's it's a real technique that's that's that's quite important in the area so I think it's just interesting to to see it to see it the flavor of it um so the ideas we're going to treat this this thing as a mathematical object some of you be familiar with the mathematic some of you some of you won't。
03:36
Basically what we're going to do with that mathematical object is we're going To Break it into two into two pieces so。An analogy in in in numbers we can we can take a we can take a number and we can decomp it into it into its factors two things combined together give us give us what we want so ninety nine is three lots of thirty three。
04:00
We can also do it in a in an approximate in an approximate way do some sort of approximate breakdown of the components of our about number so one hundred and sixty seven is approximately seventeen times nine point eight。And if you multiply them together it gives almost almost what you want gives a hundred and sixty six point six instead of a hundred and sixty seven so。This this is pre intuitive if you're just working with numbers。So。The bit you have to take on trust to a certain amount a certain degree we'VE got we'VE got a table and in this case it's a table of of ratings which is which is which is an r。And it turns out rather nicely that you can break this table down into into two smaller tables so that when you can b them together you'will get almost almost what you started started with and the details of of how you do based on these these mathematical matrix objects。
05:09
So this is this is the sort of thing that you you do you start off on the top left hand side with some some table you break it down into two smaller things such that when you multiply them together。And there's a whole algebra for doing that so that when you multiplied together look at what we get we get something that。Is almost is almost the same as as what we started with essentially。So you look at five two three six。And I break it down into these two pieces and when I combine them back together I get almost the same stuff I get a three point two two seven。Instead of a five or two point zero five instead of a two three point six instead of a three so all I'm doing is just matching up the entries here with the entries over there。
06:00
You can also calculate some quality there you can you can ask yourself well how how good how good was it how good was my my my breakdown and the way you can do that is just by computing this this error term so the you take each each cell compute the difference with what your factorization is versus what you really saw you come out with some era。And unless the era the better the be you'VE done。All right why is why is all of this useful for us well it turns out you can do this even if there's missing there's missing pieces in the rating matrix so even if you'VE got a top left hand rating matrix where the stuff you don't know it turns out you can actually break it into into factors in exactly the same way。And when you multiply them together you'LL get stuff that is almost the same as the non missing value so four point nine five is almost the same as five eight point six fives。
07:04
At something we'VE we'VE guessed six point out two is almost the same as six so you can see what we'VE done is we'VE filled in all the missing entries essentially by breaking it down into two pieces re combining them we can come up with a guess for all of for all of the the bits that we don't know。So this this whole this whole。Exercise got a name it's called matrix factorization it's called latent latent factor analysis if you do psychology you'you you'you'LL use this a lot for。They do this with personality analysis breaking down people's behavior into late in factors but you can do it in in recommended systems。Um。There's a few choices you need to make about。How big or small little pieces are the actual dimensions of them。
08:03
And that's that's probably going To Be beyond our scope but you do have to make some choices。Commercial systems have have have often used a technique a bit closer to this because you're actually。When you when you go through it or you're actually considering the rows and the columns together rather than just considering users by themselves or or items items by themselves there was a very famous competition a number of years ago Netflix after a million dollars。They gave they gave out a big a big rating table and ask people to Philippine。And the people who got closest one a million dollars。It was called the Netflix Netflix prize so hundred million ratings was in the the data that they released is quite an interesting history behind it you can you can read about it in the paper that's attached on the。
09:03
The website。They then thought it was such a good idea they do it again and a couple of years later but it turned out that when they tried to do that somebody was able to reverse engineer the data and find out who。Who was actually writing what movies in the data they released so there was a privacy leakage and they they cancelled it。It will'talk more about that a bit a bit lighter。But。The point of the point of the exercise is just to illustrate that there's another way of doing all of this that's a bit more a bit more complicated but it turns out and practice very accurate。Happy to take any any queries or questions on that。Yes question yes。Look I can tell you all of the big Tech companies will be doing this in different ways for making recommendations absolutely。
10:08
Yeah now we don't know exactly what they do but but this is this is the type of thing that they do um and underneath the mathematics underneath is all linear alrebra you if you ever do a course on on that。Okay you can read a bit more about it and one of the references that's probably listed at the end yep that's the reference quite an interesting reference you can skip over the mathematics if you're wanting to read it as well can still be quite useful。All right sorry。Where do we where do we go next we'we'VE looked at missing values we'VE looked at data cleaning we'going to look a bit more about visualization today and so so far we'VE looked at XL Jason cleaning outliers fill in missing stuff。
11:04
In your project you'LL see there's a lot of visualizations visualizing data are good a useful thing so let's let's have a bit of a。A look a look at visualization。So。Visualization is good don't need to say too much there。Basically if you can come up with a good visualization you can often see some interesting stuff。This example down the bottom we'VE got a table it's just got these zeros and and a few non zero you look at that table it's very hard to to know what's going on you'plot it in in some visualization you can see。You can see it's the digit one。This is a hand a handrickten digit the number one。Simple simple visualization。So'we'VE seen what have we'VE seen we'VE seen box plots。
12:04
The spread of the values we'VE seen。Calculations on those in outliers so plotting the data to see things that are strange。Scatter plots。Visualizing in d or d what's the limitation of a scatter plot why don't I use scatter plots for everything。Is there is there a reason。If so what might it be。So'get a plot yes。Yep heaps of data so you got lots of lots of points and the graph become really the plot accounts really ugly。I totally agree anything else that's related to that。Suppose I even have a small amount of small number of rows is it still yes Dana front。I boing the data to find outliws is subjective yes so you you're criticizing。
13:06
The scatter plots from the point of view of what you do with them maybe it's it's subjective yes that's that's true but what if we had a really great human expert someone we really trust looking at the data。Is there is there another reason that the visualization might might a scatter plot visualization might not work yeah。So maybe we can just firstly a strong channel we can make。Two dimensional to one dimensional。Okay so I think what you're thinking about is that it's the number of dimensions that is is somehow a key ingredient and plotting in two d or three d it's great if if your data has two or three features but what if it has a hundred it it becomes a pain you're going to have to。Choose choose pears or triples and plot them and it's pretty hard to make that choice。Um。
14:00
So that I guess motivates other techniques and。I'm just going to talk about flowers flowers for a minute because that's the data set that's in these examples it's a very famous data set a flower data set you can download it from Wikipedia three types of flowers and if you're a flower。Inthet you could measure measure stuff about flowers in particular you could measure four different features you could measure the width and length of something called the sele and similarly for the petal。And you get a you get a data set that that is is something like this you'VE got each rows of flower you'VE got four four different features attributes and as'a species and we'VE got in our data set we'VE got。Three different species of flower okay so we got this this data set about flowers that's what it looks like that's what a flower looks like and the bits that are being measured of these things in here。
15:07
What can you what can you do。One thing you could do is plot plot a hisstogram。Um so the the X axis here the petedal withd this is just one of those columns we'plotted it we'VE bend it we put it inTo Buckets and we count the number of things in h bucket the number of flowers in h bucket and what do we see。PAL width。Most of the flowers have a very small petal with is one conclusion we could make。And。How the any or none have have have close to a PAL with of one。Sorry can do that we looked at hisograms but you have to choose the bin size。It can be a headache。Scatter plots two d plots。So we'got four。Four different features we can take every pair of them and do do a plot。
16:03
And so I'VE got a four by four。Six different plots of all pairs and I can color color the species according to。We'LL just color the species。And you look at that。What do you come away with。Someone suggest。Some useful useful information that that we could get out of out of this this cat a plot。Yeah we want pretty to me。Yeah so great observation so the blue ones look a little bit different from everyone else so maybe they're slightly strange sort of strange sort of species。Seems To Be a trend that happens across all of all of the different little boxes。Anything else?That's a good one。
17:00
Just general general sort of。Statement that might be mostly true。Yeah。Bigger。Oh okay and so you you're thinking about the magnitude of the the measurements and in general the the red ones are sort of pushed further up or push further to the right um yeah looks right good good observation。You might also observe the the clustering straegies some some the group together quite nicely the reds are altogether a lot of the time the greens are altogether a lot of the time they're not。Randomly spray across so we get we get some good stuff even from this。So this is one way to do it you can always do it as I said it just becomes a headache if you'VE got lots of lots of features。Another way to do it is is is basically color color everything according to。
18:06
Its value so that's that's called a called a heat map um so this is the sort of thing that we're talking about so I'VE got again this is just my data I'VE got the rowr each flower so I'VE got one hundred and fifty flowers I'VE sorted them according to species on the on the um the y axis so they all of these are just flowers in our group and according to species。And then each of my four features is。Across the bottom and the color tells me the the the value the the strength of of a particular feature value。So red is very high。And so what was that great observation in the。The red ones seemed To Be the virgin has seemed To Be a bit higher in terms of magnitude。
19:04
Um so yeah the virginica has a bit more yellow so they tend To Be a bit higher than than the other ones so I can look at that plot and I could make the conclusion that the genear looking across left to right。Is is a bit more yellow than everyone else in general it's it's heart got high measurements that the'bigger bigger flowers perhaps。I can I can do that。You might have to play around with the scaling if the if the features are on different scales To Be hard to compare them in terms of color say might need to reco into a range zero to one or or do a statistical normalization。But this just go to hate map that's that's all all it is。Let me show you another example。
20:00
Which is possibly a little bit more fun。All right。Okay so。Ah just taken this from from a website and what they'VE done is they'VE taken they'VE taken a data set of everyone's。Birthday。I think about your birthday think about everyone else's birthday sitting in the lecture。And then we count the number of people who have a birthday on January the first。Or a birthday on the eighteenth of June and the darker it is。The more common your your birthday is the lighter it is the less common your birthday is。This yellow this yellow thing is is the most common birthday so what it is it September seventeen is Australia's most common birthday using this using this data。
21:03
So if you're born in。Around September here we go look how dark it is in September you'VE got a very common birthday you'born in decem。If you're born in December your birthday is less less common much less common in a sense。去。And this is just it's just a hate map and it's telling us some interesting some interesting stuff。So what do you do you look at this and immediately it suggests all kinds of questions you ask yourself why is why is September the most common common birthday。
22:09
Anyone want to suggest based on。Based on intuition。Why why September。Seventeen so common any suggestions。Is sorry spring yes spring right but so this is the day people are born yet so that people are conceive before spring。Go back about nine months and you're probably in in January early January December the holidays and that's a you know a time where things happen so September September perhaps people believe is is like this social social dynamics at play there's also some other analysis on this page let's have a look。
23:13
He's another graph it's quite nice。So these are the on the X axis it's it's it's the date someone is born and the whyis is how many people are born on that particular date。So the higher the higher it is the more people are born then we'VE got this huge dip。Going down twenty fifth of December Christmas boxing day very very few relatively speaking people are born。And as it serves on the bottom maybe people are somehow avoiding。Having having a child on that day holding on a little bit or something I don't know and there might be one he is just another way of。
24:14
Presenting the same sort of information again again it's just looking at at month by month and you look at what month you're born and see whether that's above average month for birthdays or below average month for birthdays。And December again is。Under representped so these if you're born in December it probably means you're conceived around march so there's something about march that'maybe less you know。Yeah。Less less promoting of it all happening。Another interesting thing again this is just like a date by date plot so again the white axis is just how many people born the higher the more the lower fewer people um you can see it sort of goes up and down。
25:14
Sort of pretty crazy when you think about it look at it initially it turns out some of these dates fall on a weekend more often。So again it's this it's this behavior that somehow you're less likely To Be born on a weekend than than a week day。Um and the reason that'that's offered here is that there's a lot of sinceiion sections so a surgical procedure in order to to have the birth maybe the doctors aren'so enthusiastic about coming into the hospital to to do that on a weekend。So it gets it gets pushed to the weekday。So interesting interesting stuff the point of the the point of the example initially at least was hate hate maps are a good a good way to start thinking about about that。
26:13
Anyone want To Give any comments or reactions on any of any of that something that doesn't look right or something I haven't mentioned。UR ls on the slide go and have a play around with it。See how popular your birthday is。Okay。Um。Hopes that wasn't it。Okay。So。That's that's the idea behind heat maps。Just coloring according to intensity。Another method that。Um。
27:00
Probably you haven't seen this one before。Less used I think perhaps but still pretty interesting is something called parallel coordinates。Let's see an example we'VE got our data again our flowers。And we。We sort of plotted horizontally is the idea so。What what's happening here we'VE got each each flower is a line okay so there's a hundred and fifty different line so each row is is a line and you can see that each feature or attribute is a vertical line so each flower is a whole is a going left to right each feature is a vertical line。And so if a flower has a separate length of eight I'LL I'LL I'LL plot the line starting there and then I'LL find it it's separate with plotted。Join the line there similarly put the line where where it is for the other attributes。
28:03
So you can。You can take a row you'VE got four four attributes you just draw the line。Um。Putting appropriate place on the on the axis for that particular particular feature。Okay and here we'VE just colored according to。To species。So it's parallel in the sense that you sort of。You'VE got parallel axis on the vertical there。Okay now I'VE ordered these a certain way so that in in my picture here I'VE put several length first I'put several with second pedal length third pedal with fourth。Why did I do that?Well I didn't have to do I I can I can reorder them in any way I like。
29:00
Here is another way so this was the first way。This is the second way looks a bit different。Instead of several length being first now I'VE got pedal with looks rather different。I can reorder again this is a third way I can do it again it looks slightly different。And again a fourth way looks different again。So every time I I play around with the order I'm going To Get a different a different picture might reveal slightly slightly different information for me。How do you know which ordering is best well that's that's a hard question I think the simple answer is you play around。There are rules of thumb for doing it but it's I think my best advices that's it's a matter of playing playing around probably you're not going to use this technique if you have too many columns too many features just going。
30:03
Be a bit of a headache if you'VE got lots and lots of of columns you have to choose choose an order an order for。Okay so。I'VE pretty much said this each object is align。I'VE said the order is important。Um。Questions just at a high level so far on。How to interpret that or how I'VE generated how we'VE generated that。Anything you want to ask。Okay。So maybe question for you guys。What not yet okay before that so。The visualization will be affected according to the scaling of of h axis so you know if I'm measuring this in meters or or or and this one in cimeters it's all going to look a bit different。
31:11
As opposed To Both of them in in centimeters so you you might want to play around with with the scale or you might want to normalize everything to fit into the rain zero to one a bit more bit more common。I'VE also said that the order the order matters。All right question。To think about。So。This is this this is exam from last year so in in this in this question from from last year。There's there's a visualization。I think this is just one of the ones from the slides we'VE got this and the question is asking what are'asking。
32:01
We do we do it we do it we do it and the question is asking explain three three inferences three three things that you could conclude。By looking at this visualization。Compared to just looking at the raw data in a spreadsheet or a csv file。So three things this visualization tells you that you might not get from the raw data。Directly。All right it's a question ah when you spend a couple of minutes see if you can come up with three things speak to the person next to you speak to anyone around you and we'LL we'LL get some some answers in a couple of minutes right over here。
33:51
You have To Be creative。
34:21
So gone quiet everyone's got three all right okay。Okay well let's take some suggestions。Give me give me give me one somebody give me one。Somebody start me off how about then okay then here what did you come up with the one ah I think it'like for only for the I andosa the length of the catle is smaller than on average。With people。Gosh I have to have to think about that so the length of the pedal is smaller than the so this this thing here is smaller than right this thing here is smaller。
35:06
Okay so I think what you're saying is this black this black group here is is somehow different and that jumps out at a mediadi that we can see that the um the setoa group the black group is is quite different in terms of its behavior across。These two features yep great。Totally agree yep。Yeah true so there longer than they are wide that that could be could be interesting from a flower point of view totally agree it's a good one yeah yeah。All these species to inside around the same range。Okay so the the suggestion you're making is the second the second feature everyone looks pretty much the same if that's what you're measuring。
36:00
So everyone looks pretty much the shame the same under this under this measurement yeah that's good one。What else we we got three fantastic another one。最烦的。Positively okay so you're looking at the last last two here and you're saying the positively correlated so you mean that if one goes up the other goes up as well if one goes down the other goes down as well and we can we can pretty much see that here。As as petedal length goes down we can see this goes down and the opposite for the others yep'great one so we can see relationships between pairs of。Of features。Anything else that we'VE missed。Any other。Insights。Yes stand on the front。
37:08
So saying there's an order in petal length and PAL width。You mean an order across the the hundred and fifty flowers which is is somehow maintained I think。A longer than。One and that's not black one。Right so I think you're saying that the the for these that the yellow yellow is top blue middle black lowest and that order is maintained across across the other the petal with feature and that's probably yeah absolutely rights good observation and maybe that's to do with the correlation between those two features they positively。Positive they correlated。Anything else I had what about?
38:00
Um。Anomalies so a good question to ask yourself or outlaws so a good question to ask yourself is can I look at this and make any judgment about outlaws。It's already been suggested that black group is an outlier so there's a whole group that's a bit strange you might also think about is there an individual flower that's a bit that's a bit strange compared to all the other。The other flowers。Um。I don't know maybe。Maybe this this flower here this this sort of one that's going down and it seems a little bit on the edge compared to its its similar。The ones in a similar species so。Perhaps we could we could say something about which flowers or the outlaws compared to the other the other flowers。
39:01
Okay so that that's exam question you ask for three three things you get three marks。As long as the three things are applausible and。Sort of don't duplicate each other too much then set for three marks。Any further comments or or queries。On that。Question yes。Like the scale on the require。And we we actually to。A similar feature compared to them see like。All the length of the。Anything is。Longer than the way of。Yeah so in this example where we we're sort of assuming here the scale is all the same may be in in cim everything's measure in cimers so it makes sense to compare these two things for this example but if we if we process our data differently and and done some normalization maybe we wouldn't be able to do that。
40:11
So it'LL depend on the data set will'depend on your measurements and and whether you whether you do something with them before you you plot yeah saying that might need To Be careful about。About that。Okay。Okay。Okay you get you get so there's there's Python pandas packages that you can you'LL play around with that um allow you to to do this you'practice it and there's a there's I think a question in the in the project as well where you where you'practice it。Okay let's let's maybe say something about project。So。
41:01
Due the tenth of April which I think was approximately three weeks elapse time from released to to do due date。What is it essentially it's essentially practicing calculations visualulations on data frames data frames that are are created by reading in a data set as one of these using the pandas library to do stuff with that data set。Most of the concepts。That the questions are based on we'VE already done um you get a bit more practice on them in in the next two workshop we'VE talked about we'talked about the more and and practiced a number and you you'LL get a bit more practice in next one or two weeks。We think we think we'VE been raised me。
42:01
Helpful and and maybe even generous with with the amount of hints that are provided with with each question so please do do do use them。That they should be helpful。Um。Probably obvious but。I know there's a wide wide range of familiarity and some of you。Love Python and and do it every day others of you haven't thought about it for a year and everyone in between。So if you're maybe a bit less bit rusty a bit less confident。We'VE provided some some resources for um some sort of tutorial links to tutor they're pretty good spend two or three hours or four hours just brushing up on those and they cover pandas as well as as well as Python so hopefully they're going To Be。Useful if you if you do that。
43:03
We have a weekly consultation session that that Chris Chris has will put on some extra ones will announce that on the ls mainly if you'VE got programming type type questions you can come and see someone at a particular particular time。That's a high level high level。Sort of some points questions anything on your mind about about project at this at this stage。A big side and the front yeah I know have to have to do it that''s how it is。Okay。All right maybe。What I will do since we have time I will sent this link。Linked today by a colleague from。
44:06
From Google。The general the g of it is you know we know Google does a lot of interesting stuff。It gets gets coal kinds of nines。Might be data wrangling might be data science might be called AI a lot of it is called AI and underneath it's it it's doing stuff with data um they have a sort of a weekly think called a Google dood where they sort of change the banner。In this particular。It's about making making music so let me。
45:02
Let me tell you what it is before we do it。So the idea is famous composer bark yohan Sebastian bark they'VE taken all of his music。They'VE taken all of his music they Fed it into a data processing machine learning engine that's。Learn。The patterns the style of his music so three hundred and sixty different different songs。And then then what this is this is just something in the browser I can type in some notes I can type in a melody and they're going to ADD some extra extra stuff to it to make it sound like it would have if bark had had composed it so the sort of the flavor is I can I can I can put in notes like this and I can put in a a melody。But I'm pretty crap at that so I'm not going to do it I'm going to type one which is already already there。
46:06
You can listen to it。So that's just twinkle twinkle little star it's not back yet and then the the magic of of of what they're doing so to speak we can we can。That's the。We can say how fast it is we can perhaps change change the key and then we press this button and it's doing some calculation。And now we can listen to it see what I think it's okay all right you can do that so essentially what it's done is it's added added extra harmony to the melody based on bark。
47:17
You can play around even more do the same sort of thing。And it even better so the point what what comes out of this this is this is cute for sure but what what's the sort of social。What are some social things we can take away from this what why is this perhaps interesting。
48:00
Tell me。Say describe the industry todays for AI gen yeah it it it it it it put it puts generation in the hands of somebody like you or me who is a total on expert it puts it in the browser。So to speak I I don't have any sat fancy software that's that's installed it puts it in the browser allows me to do to to do some stuff very very cute I could do it for any composer I could do it for any instrument imagine imagine the possibilities so I very interesting to to think about you might like to play around with itselves。All right let's let's finish up there on that note so to speak and have a good weekend。
50:07
Come otherwise streaming when you goting computation。Just like。
我来说两句