Hello Luvs, it's been a month since I've updated this blog. I like to write something once a week, but it doesn't seem to be quite possible. I decided to continue my studies on computer science, been working on our startup, and busy living. Besides, The world isn't in its best shape, and we hear a big dose of sad news daily, which makes writing even harder.
Table of Contents
Enough nagging, let's jump into it. If you ever watched/read batman, you should know about his friend Alfred Pennyworth, Alfred handle everything in the shadows so batman can be a shining star. Even though these characters are fictional, but the idea of a loyal, tireless friend is pretty much valid in the real world.
Probably in the whole history of humanity, we have never been this close to create phenomenal tools. Modern technology can offer you something that was marked as impossible just a few years ago, no matter what your profession is. Modern technology is so advanced that we can use it to create an Alfred! In this post, we try to build an early version of Alfred, which helps us with a sample task of scraping and categorizing some useful data.
Now, as this blog is mostly about infosec, we are using an infosec case study for this post, but you can use the idea for whatever you are doing.
Let's imagine my job is an offensive security engineer, which means I do penetration testing, red teaming, bug bounties, etc. What my ideal Alfred will do?
Technically speaking, it's possible everything listed here using machine learning, scraping, consuming various API, IoT devices. So if you own a restaurant or work as tech support for hosting company, you can still have your version of Alfred. So you may ask if it's possible to automate everything literally why people don't do it?
1- Oh, Dear, it's easier said than done.
2- they are doing, and they are talking about it. Tesla, Yea, autonomous cars are an example of automation; they automate driving.
3- They are doing it, and they don't talk about it. There are many examples I like you to think about it. Hint? Bots.
Let's begin creating our own Alfred; we can't create fully functional Alfred in a single post; we can barely touch the surface. So remember our hypothetical case study job? It starts by executing at any time and gathering articles and resources in the related field.
Okay, for our infosec guy, this is the chosen list.
-Gather recent exploits, advisories, and write-ups; he needs to know about new vulnerabilities (sources: exploit-DB, GitHub advisory, HackerOne Hactivity, pentesterland write-ups)
- Gather news; He wants to know whats going on in the industry (sources: thehackernews.com, Reddit netsec, NewsAPI with filters on specific keywords. "0-day", "hacker", "data-breach" , "bug-bounty" ,"vulnerability" , "malware"
- Gather new jobs; well, he wants to know trending jobs (source: infosec jobs)
Okay, so how we want to gather these data? By using a well-known technique called web scraping..
Scraping is one of the base technique used by most automation software from SEO tools to fancy treat intelligence software. It's genuinely on art because there are unlimited ways to scrap a particular data. Working with multiple sources, you may have to use a different technique for each source. We are going to use Golang for our example, but you can use any programming language you like. I believe Python suites web scraping better than any other language.
Let's start with something easy, pentesterland
let's say we want ten last write-up.
...
c.OnHTML("#bug-bounty-writeups-published-in-2020", func(e *colly.HTMLElement) {
e.DOM.Next().Find("a").Each(func(i int, selection *goquery.Selection) {
link,_ := selection.Attr("href")
title := selection.Text()
if i >= 18{
return
}
// filters and ten records
if strings.Contains(selection.Text(),"@") || strings.Contains(link,"twitter"){
return
}
pentesterLandArr[i] = []string{title,link}
log.Println(title)
log.Println(link)
...
Here what I've done is finding the ID "#bug-bounty-writeups-published-in-2020" and discover every link after it. Finally, make sure the link isn't a twitter link and filter a few first results.
The same goes for infosec-jobs.com and thehackernews.com, github.com/advisory.
//infosec-jobs
c.OnHTML("#job-list", func(e *colly.HTMLElement) {
e.DOM.Find("a").Each(func(i int, s *goquery.Selection) {
// only last 10 entries
if i >= 10 {
return
}
if s.Find("p").HasClass("job-list-item-company"){
//fmt.Print(s.Find("p").First().Text() ) company
link , _ := s.Attr("href")
title := s.Find("p").Next().Text()
infoSecJobsArr[i] = []string{title,link}
//log.Print(link)
//log.Println(title)
}
// thehackernews
c.OnHTML("#Blog1", func(e *colly.HTMLElement) {
e.DOM.Find(".story-link").Each(func(i int, s *goquery.Selection) {
// only last 10 entries
if i >= 10 {
return
}
link , _ := s.Attr("href")
title := strings.TrimSpace(s.Find(".home-title").Text())
//log.Println(title)
hackerNewsArr[i] = []string{title,link}
//log.Println(link)
// Github Advisory
c.OnHTML(".Box", func(e *colly.HTMLElement) {
e.DOM.Find(".Box-row").Each(func(i int, s *goquery.Selection) {
// only last 10 entries
if i >= 10 {
return
}
link , _ := s.First().Find("a").Attr("href")
title := strings.TrimSpace(s.First().Find("a").Text())
//log.Println(title)
githubAdvisoryArr[i] = []string{link,title}
log.Println(title)
//hackerNewsArr = append(arr["URL"], &hackerNewsArr )
})
Just finding the ID and location of links and grab them. Now let's make the game more interesting. Let's say we want to scrap Reddit netsec items, and we can't use our classic HTML parsing technique; what else can we do? We can always try to find another endpoint which gives us the information we want. In this case, I found an endpoint I didn't know it exists.
it returns ten items in JSON from the subreddit I want. What else could I wish for? Nothing. Now there is an issue if you try to CURL discovered endpoint you will get {"message": "Too Many Requests", "error": 429}. Huh? When I refresh the endpoint in the browser, it still shows me the JSON. Can you guess what's going on here? A silly user-agent check. Let's bypass it.
....
client := &http.Client{}
// Didn't know such endpoint exists
req, err := http.NewRequest("GET", "https://www.reddit.com/r/netsec/.json?count=10", nil)
if err != nil {
return nil, err
log.Println(err)
}
// well we also need a bypass for reddit client check
req.Header.Set("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36")
resp, err := client.Do(req)
if err != nil {
return nil,err
log.Println(err)
}
....
Let's move on exploit-DB; this one is interesting because it uses javascript to create the table dynamically. In this case, we can't scrap it using a classic scraping technique. We have to use either a headless browser with a JS engine or find an endpoint like the Reddit case. Here is my solution. We send a request to draw endpoint, and to ensure it returns JSON; we make a fool out of it using ("x-requested-with","XMLHttpRequest"), which spoof AJAX behavior.
....
// url-encoded query is already filtered for 10 entities
dtQuery := "&columns%5B0%5D%5Bdata%5D=date_published&columns%5B0%5D%5Bname%5D=date_published&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=download&columns%5B1%5D%5Bname%5D=download&columns%5B1%5D%5Bsearchable%5D=false&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=application_md5&columns%5B2%5D%5Bname%5D=application_md5&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=verified&columns%5B3%5D%5Bname%5D=verified&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=description&columns%5B4%5D%5Bname%5D=description&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=false&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=type_id&columns%5B5%5D%5Bname%5D=type_id&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=false&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=platform_id&columns%5B6%5D%5Bname%5D=platform_id&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=false&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=author_id&columns%5B7%5D%5Bname%5D=author_id&columns%5B7%5D%5Bsearchable%5D=false&columns%5B7%5D%5Borderable%5D=false&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B8%5D%5Bdata%5D=code&columns%5B8%5D%5Bname%5D=code.code&columns%5B8%5D%5Bsearchable%5D=true&columns%5B8%5D%5Borderable%5D=true&columns%5B8%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B8%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B9%5D%5Bdata%5D=id&columns%5B9%5D%5Bname%5D=id&columns%5B9%5D%5Bsearchable%5D=false&columns%5B9%5D%5Borderable%5D=true&columns%5B9%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B9%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=9&order%5B0%5D%5Bdir%5D=desc&start=0&length=10"
client := &http.Client{}
// dataTables scraping technique
requestUrl := "https://www.exploit-db.com/?draw=1" + dtQuery
req, err := http.NewRequest("GET", requestUrl, nil)
if err != nil {
log.Println(err)
}
// well , let's make a fool out of it
req.Header.Add("x-requested-with","XMLHttpRequest")
resp , _ := client.Do(req)
....
The same goes for HackerOne. We can't extract data classically because it will respond differently to a requester without a JS engine. So again, we can use a headless browser or another endpoint. This time let's use HackerOne GraphQL endpoint to achieve our goal.
...
client := graphql.NewClient("https://hackerone.com/graphql")
// make a request
req := graphql.NewRequest(`
query HacktivityPageQuery($querystring: String, $orderBy: HacktivityItemOrderInput, $secureOrderBy: FiltersHacktivityItemFilterOrder, $where: FiltersHacktivityItemFilterInput, $maxShownVoters: Int) {
me {
id
__typename
}
hacktivity_items(last: 25, after: "MjU", query: $querystring, order_by: $orderBy, secure_order_by: $secureOrderBy, where: $where) {
total_count
...HacktivityList
__typename
}
}
fragment HacktivityList on HacktivityItemConnection {
total_count
pageInfo {
endCursor
hasNextPage
__typename
}
edges {
node {
... on HacktivityItemInterface {
id
databaseId: _id
...HacktivityItem
__typename
}
__typename
}
__typename
}
__typename
}
fragment HacktivityItem on HacktivityItemUnion {
type: __typename
... on HacktivityItemInterface {
id
votes {
total_count
__typename
}
voters: votes(last: $maxShownVoters) {
edges {
node {
id
user {
id
username
__typename
}
__typename
}
__typename
}
__typename
}
upvoted: upvoted_by_current_user
__typename
}
... on Undisclosed {
id
...HacktivityItemUndisclosed
__typename
}
... on Disclosed {
id
...HacktivityItemDisclosed
__typename
}
... on HackerPublished {
id
...HacktivityItemHackerPublished
__typename
}
}
fragment HacktivityItemUndisclosed on Undisclosed {
id
reporter {
id
username
...UserLinkWithMiniProfile
__typename
}
team {
handle
name
medium_profile_picture: profile_picture(size: medium)
url
id
...TeamLinkWithMiniProfile
__typename
}
latest_disclosable_action
latest_disclosable_activity_at
requires_view_privilege
total_awarded_amount
currency
__typename
}
fragment TeamLinkWithMiniProfile on Team {
id
handle
name
__typename
}
fragment UserLinkWithMiniProfile on User {
id
username
__typename
}
fragment HacktivityItemDisclosed on Disclosed {
id
reporter {
id
username
...UserLinkWithMiniProfile
__typename
}
team {
handle
name
medium_profile_picture: profile_picture(size: medium)
url
id
...TeamLinkWithMiniProfile
__typename
}
report {
id
title
substate
url
__typename
}
latest_disclosable_action
latest_disclosable_activity_at
total_awarded_amount
severity_rating
currency
__typename
}
fragment HacktivityItemHackerPublished on HackerPublished {
id
reporter {
id
username
...UserLinkWithMiniProfile
__typename
}
team {
id
handle
name
medium_profile_picture: profile_picture(size: medium)
url
...TeamLinkWithMiniProfile
__typename
}
report {
id
url
title
substate
__typename
}
latest_disclosable_activity_at
severity_rating
__typename
}
`)
...
Finally, we can always use the actual API if it's available. It's far more robust than scraping. Let's do that for our NewsApi.org API.
// selected stuff
keywords := []string{"0-day", "hacker", "data-breach" , "bug-bounty" ,"vulnerability" , "malware"}
newsAPIArr := make(map[int][]string)
counter := 0
var newsApi NewsApiResp
for i := 0; i<len(keywords);i++{
query := fmt.Sprintf("?qInTitle=%s&pagesize=%d&sortBy=publishedAt&language=en&apiKey=%s" , keywords[i] , pageSize ,key )
resp , err := http.Get(endPoint+query)
if err != nil {
return nil,err
log.Println(err)
}
decoder := json.NewDecoder(resp.Body)
err = decoder.Decode(&newsApi)
if err != nil {
return nil,err
log.Println(err)
}
Now we just created our scraper code barely. We need to store it nicely together. Something like this will do :
func WriteNewsAPIToDB(newsArr map[int][]string,entity Entity , db *gorm.DB) (int,error) {
totalFound := 0
for _,key := range newsArr{
entity.Title = key[0]
entity.URL = key[1]
entity.Source = "NewsAPI"
if err := db.Create(&entity).Error; err !=nil {
log.Println(err)
}else {
totalFound++
}
entity.ID++
}
return totalFound,nil
}
Let's wrap everything with a simple UI for one-click usage. We use fyne.io here for our UI, which is pretty awesome but too young for production,
but our Alfred at this stage is for personal usage, so we are doing fine. May you think a GUI seriously? hey ! this one is my Alfred, so my rules! :D and what I'm trying to say is how limitless possibilities you have otherwise a GUI like this might not be that useful. Yours can be command line, with a web UI, or even come with hardware.
...
myApp := app.New()
myWindow := myApp.NewWindow("InfoSec Alfred")
// fun
greet := canvas.NewText("Hello master "+master.Name, color.RGBA{
R: 189,
G: 147,
B: 249,
A: 0,
})
centered := fyne.NewContainerWithLayout(layout.NewHBoxLayout(),
layout.NewSpacer(), greet, layout.NewSpacer())
image := canvas.NewImageFromResource(resourceAlfredLgPng) // NewImageFromFile( "./assets/alfred-lg.png")
myWindow.Resize(fyne.NewSize(300, 300))
image.FillMode = canvas.ImageFillOriginal
progress := widget.NewProgressBar()
progress.SetValue(0)
status := widget.NewLabel("Idle...")
statusContainer := fyne.NewContainerWithLayout(layout.NewHBoxLayout(),
layout.NewSpacer(), status, layout.NewSpacer())
...
I know coding parts might remind you of this photo:
but that's not the case here. source-code is available here.
Keep in mind if you are scraping some data which aren't supposed to, you may get yourself in legal trouble, so be careful.
We barely touch the surface on both web scraping and automation.
The main idea of this post is to make you believe more in automation. Keep in mind, don't overdo it. Automation is useful only if it makes sense to automate something. If you have a brilliant idea for automation,
it's your turn to create your own Alfred. Your Alfred can perform different tasks; it's all about what you need and what helps you. Don't worry if you can't code if your plan is compelling enough you can probably find a software for it or outsource it. You can also always contact me if you need help building your stuff.
If you love the idea of having an Alfred and you work in InfoSec, make sure checkout HunterSuite which is our sophisticated Alfred for offensive-security related tasks. That's it for this post. Don't forget, repo is here.
I also need to thank all of my lovely readers. I started this blog just because I felt an urge to talk and share now it has thousands of readers. Even though you barely speak to me directly (which you can !), I still appreciate your presence; it gives me the courage to write more.
0xSha