[WEB] robots.txt

Posted Apr 24, 2024 Updated Jun 24, 2024

By geunsuyoon

3 min read

SEO를 높이기 위한 방법 중 하나인 robots.txt에 대한 설명이다.

규칙

파일 이름은 robots.txt여야 한다.
사이트에는 robots.txt는 하나만 있어야 한다.
웹사이트 호스트의 루트에 있어야 한다.
- 예를 들어 \https://www.example.com/ 아래 모든 URL에 관한 크롤링 제어하려면 \https://www.example.com/robots.txt에 존재해야 한다.
  - 하위 디렉토리에 배치하면 안된다!!
하위 도메인 또는 비표준 포트에 게시할 수 있다.
- 하위 도메인 예시: ‘https://website.\example.com/robots.txt’
- 비표준 포트 예식: ‘\https://example.com:8181/robots.txt’
게시된 프로토콜, 호스트, 포트 내의 경로에만 적용된다.
- 위 경로에 배치한 robots.txt 규칙은 ‘\https://m.\example.com/’과 같은 하위 도메인이나 ‘http://example.com/’과 같은 대체 프로토콜에는 적용되지 않는다.
ASCII를 포함한 UTF-8로 인코딩된 텍스트 파일이어야 한다.
- UTF-8 범위 외 문자는 무시할 수 있어 robots.txt의 규칙이 무효화될 수 있다!!

# This robots.txt file controls crawling of URLs under https://example.com.

# All crawlers are disallowed to crawl files in the "includes" directory, such

# as .css, .js, but Google needs them for rendering, so Googlebot is allowed

# to crawl them.

User-agent: *

Disallow: /includes/

User-agent: Googlebot

Allow: /includes/

Sitemap: https://example.com/sitemap.xml

${URL}은 웹사이트의 사이트맵 위치이다.
정규화된 URL이어야 한다.
- 구글의 경우 http, https, www를 포함하는 URL과 대체 URL을 가정하여 확인하지 않는다.
- 웬만해선 URL의 전문을 다 써야 한다!!
첫 페이지를 제외한 나머지 페이지 접근을 차단하려면 다음과 같이 작성하면 된다.

User-agent: *

Disallow: /

Allow: /$

This post is licensed under CC BY 4.0 by the author.