0
votes

I have a dataset with 2 variables (let's say x and y). x takes values from 1 to 1000000, and y is either 0 or 1.

I can plot a histogram with 100 bins for x with ggplot with

ggplot(data, aes(x=x))+geom_histogram(bins = 100).

However, I want each bar to be coloured based on the number of times y==1 in the range of that bar. Something important of the dataset is that the number of times y==1 is less than 1%.

Can anyone please help me out?

edit- to clarify what I mean: x is

[1]    2   10   10   10   12   18   33   35   38   42   44   46   59   60   64   69   69   73   74   76   78   83   84   84   85   88
  [27]   99   99  103  112  115  118  124  125  138  140  140  140  141  143  145  150  153  154  156  156  180  190  193  194  196  200
  [53]  205  209  221  225  227  230  231  234  237  239  241  244  248  256  259  259  260  266  267  273  280  282  283  284  288  290
  [79]  293  294  294  297  298  307  309  310  312  313  315  315  317  322  328  332  333  340  346  346  352  363  365  366  369  375
 [105]  378  380  382  384  386  387  399  403  403  406  411  425  427  427  433  439  441  442  443  446  448  450  453  457  459  460
 [131]  462  463  463  466  471  472  472  472  472  480  480  487  489  493  496  513  513  514  517  521  523  525  528  538  543  549
 [157]  550  550  551  564  566  581  588  592  600  605  610  610  614  614  623  628  629  631  642  646  646  648  651  654  654  656
 [183]  656  660  674  681  683  692  693  710  721  722  723  723  726  734  738  749  750  751  752  758  764  770  770  773  788  788
 [209]  790  795  804  804  805  809  810  811  821  823  830  862  862  866  868  869  874  881  890  892  899  905  907  908  909  909
 [235]  911  912  916  917  921  921  922  923  925  933  938  938  942  947  952  956  963  966  967  974  980  986 1000 1016 1023 1027
 [261] 1030 1034 1035 1036 1040 1052 1054 1055 1066 1071 1073 1074 1082 1082 1083 1084 1093 1097 1113 1114 1114 1117 1117 1120 1129 1132
 [287] 1138 1148 1152 1158 1161 1171 1174 1176 1177 1188 1201 1205 1206 1221 1227 1228 1230 1236 1238 1256 1259 1260 1261 1263 1264 1266
 [313] 1271 1272 1287 1290 1294 1295 1298 1303 1308 1317 1323 1324 1328 1332 1335 1340 1347 1352 1353 1354 1355 1356 1357 1363 1368 1379
 [339] 1380 1387 1396 1398 1399 1402 1403 1406 1410 1421 1421 1430 1432 1433 1434 1436 1443 1447 1459 1460 1464 1469 1471 1472 1474 1485
 [365] 1487 1488 1490 1494 1495 1496 1502 1502 1504 1506 1506 1518 1522 1526 1526 1531 1540 1548 1549 1552 1559 1562 1571 1573 1579 1580
 [391] 1582 1582 1587 1592 1599 1613 1619 1623 1631 1631 1634 1644 1655 1673 1673 1675 1681 1701 1704 1713 1719 1720 1720 1738 1757 1773
 [417] 1780 1784 1787 1793 1797 1801 1803 1812 1815 1817 1818 1820 1828 1832 1834 1835 1837 1839 1840 1840 1840 1842 1853 1870 1872 1873
 [443] 1873 1877 1881 1891 1895 1904 1906 1907 1926 1929 1937 1940 1947 1948 1951 1958 1982 1985 1993 1999 1999 2002 2012 2019 2023 2039
 [469] 2051 2054 2055 2057 2061 2061 2062 2086 2086 2090 2094 2095 2100 2103 2106 2106 2107 2108 2108 2108 2113 2113 2119 2125 2129 2148
 [495] 2154 2156 2156 2162 2165 2173 2184 2187 2189 2195 2208 2213 2213 2228 2242 2246 2269 2270 2270 2280 2280 2291 2292 2295 2301 2302
 [521] 2316 2319 2362 2368 2373 2397 2398 2400 2407 2416 2418 2421 2422 2423 2427 2428 2429 2430 2430 2431 2432 2435 2436 2437 2437 2440
 [547] 2440 2441 2441 2443 2466 2468 2469 2471 2471 2474 2477 2480 2483 2484 2494 2498 2500 2501 2519 2539 2542 2549 2550 2553 2565 2566
 [573] 2568 2573 2590 2601 2602 2604 2609 2614 2616 2618 2623 2640 2642 2645 2658 2663 2666 2669 2683 2690 2698 2699 2710 2714 2716 2718
 [599] 2718 2722 2731 2736 2742 2742 2743 2754 2757 2758 2777 2786 2790 2793 2793 2798 2800 2802 2805 2820 2829 2833 2834 2847 2853 2858
 [625] 2874 2890 2893 2895 2896 2904 2907 2908 2910 2912 2913 2914 2916 2919 2919 2920 2922 2922 2923 2924 2924 2925 2926 2927 2932 2935
 [651] 2938 2941 2942 2942 2949 2961 2975 2975 2984 2985 2993 2993 3006 3010 3017 3019 3021 3023 3037 3046 3047 3048 3049 3056 3056 3060
 [677] 3063 3066 3067 3068 3072 3081 3082 3083 3084 3086 3092 3102 3105 3106 3110 3110 3121 3122 3122 3135 3136 3142 3143 3143 3150 3152
 [703] 3154 3155 3157 3163 3186 3200 3222 3227 3228 3228 3232 3243 3248 3261 3261 3263 3270 3271 3276 3278 3308 3316 3317 3322 3327 3329
 [729] 3333 3345 3370 3373 3374 3374 3376 3381 3390 3405 3410 3423 3424 3436 3441 3464 3472 3483 3485 3493 3498 3512 3529 3533 3543 3562
 [755] 3583 3610 3617 3624 3626 3629 3630 3635 3636 3637 3637 3640 3646 3648 3662 3684 3686 3695 3697 3724 3726 3729 3734 3737 3738 3738
 [781] 3739 3740 3741 3745 3745 3745 3746 3746 3746 3746 3747 3747 3747 3748 3748 3748 3749 3749 3750 3750 3751 3751 3752 3752 3752 3752
 [807] 3753 3753 3753 3753 3754 3754 3754 3754 3754 3754 3755 3755 3755 3755 3755 3756 3756 3756 3756 3757 3757 3757 3757 3757 3758 3758
 [833] 3759 3760 3760 3760 3762 3763 3763 3765 3766 3767 3767 3767 3767 3769 3769 3770 3770 3770 3770 3770 3771 3772 3786 3794 3803 3803
 [859] 3810 3814 3819 3825 3826 3835 3838 3842 3851 3852 3854 3862 3865 3882 3889 3896 3915 3923 3947 3950 3960 3967 3969 3970 3971 3983
 [885] 3992 4015 4029 4048 4085 4105 4107 4118 4118 4129 4148 4153 4153 4168 4179 4182 4185 4209 4228 4230 4241 4245 4250 4267 4276 4280
 [911] 4280 4287 4299 4319 4322 4328 4329 4337 4350 4355 4363 4368 4387 4391 4395 4398 4402 4415 4422 4429 4433 4433 4442 4462 4466 4469
 [937] 4480 4485 4493 4496 4498 4519 4526 4528 4537 4540 4543 4549 4552 4553 4558 4558 4571 4578 4630 4636 4636 4636 4641 4648 4650 4662
 [963] 4690 4719 4729 4744 4747 4769 4771 4783 4787 4792 4827 4846 4855 4871 4871 4880 4894 4917 4933 4942 4956 4958 4963 4977 4983 4995
 [989] 5032 5037 5043 5093 5098 5102 5111 5112 5115 5137 5149 5155
 [ reached getOption("max.print") -- omitted 49247 entries ]

and y is:

   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [66] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [131] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [196] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [261] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [326] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [391] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [521] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [586] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [651] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [716] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [781] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [846] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [911] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [976] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [ reached getOption("max.print") -- omitted 49247 entries ]

x' summary is:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2   54242   84428   94452  139052  172792 

and y' table:

    0     1 
49755   492 

ggplot(data, aes(x=x))+geom_histogram(bins = 100) gives me the following plot: enter image description here

For each bar, I need the colour to change to a different shade of (for example) blue, based on the number of times the value of y is 1 if the range represented by the bar

edit2 :

df <- data.frame(a=c(rep("T", 800), rep("F", 200)), b=round(runif(1000, min=1, max=100)))
df$c<-cut_number(df$b, n=100)
df<-group_by(df, c) %>% mutate(ag=sum(a=="F"))
ggplot(df) + geom_bar(aes(x=c, fill=ag))

enter image description here

1
Could you explain what you mean by " y==1 in the range of that bar"?NelsonGon
aes(x=x,fill=y) doesn't change the plotqwerty asdf
Yes, I misinterpreted your question. Check my new comment above.Thanks to @iod.NelsonGon

1 Answers

2
votes

You can't do that directly with the histogram, so you have to create your own bar graph manually, by breaking the data into the appropriate groups and assigning each group its proportion of a==1 (or F in the example below). You can then tweak the theme a bit to make it more similar to histogram's default.

Data:

df <- data.frame(a=c(rep("T", 800), rep("F", 200)), b=round(runif(1000, min=1, max=100)))

Create groups within df (binnum=number of bins):

binnum=10
df$c<-cut_interval(df$b, n=binnum)
df<-group_by(df, c) %>% mutate(ag=sum(a=="F")) #for each group, find the number of times a is "F"
ggplot(df) + geom_bar(aes(x=c, fill=ag)) # create a histogram based on the bins created, filled based on the number of times a is F

Result:

enter image description here

Now to tweak the appearance a bit:

To make the space zero, add width=1 to your geom_bar(). To adjust the ticks, we can do this:

binnum<-50
df$d<-as.integer(df$c)*nrow(df)/binnum #create a row of the max of each bin
ggplot(df) + geom_bar(aes(x=as.factor(d), fill=ag), width=1) +
  scale_x_discrete(breaks=plyr::round_any(seq(1,max(df$d),length.out = 4),max(df$d)/binnum,f=ceiling)) # change the length.out to the number of ticks you want

Result:

enter image description here